Download mapping variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
DATAWAREHOUSING
WHY Data Warehousing?
Data warehousing is mainly done for the reporting purposes. All the historical data is put
into a Data warehouse, which can be thought of as a Very large Database. Later on, reports
are generating out of this Data Warehouse to do analysis of the business.
What is difference between Enterprise Data Warehouse (EDW) and a Data Mart?
EDW consists of all the information associated with the entire Organization. For example, it
will contain information about all the departments (Say Finance, Human Resource,
Marketing, Sales etc).
Where as Data Mart ONLY contains the data that is specific to one department (Say only
Finance).
Data Warehousing Tools
ETL Tools
ETL means Extraction, Transformation and Loading. And tools that extract data from
different data sources (SQLServer, Oracle, Flat Files, Sybase etc) into a Datawarehouse are
known as ETL tools. Some popular ETL tools in market are Informatica, Ab Initio and
Datastage.
Reporting Tools
Reporting tools are used to generate the reports out of the information (data) stored in the
Data warehouse. Some popular reporting tools in the market are Business Objects, Cognos,
Microstrategy etc.
Data Modeling
A Datawarehouse is based on Fact and Dimension tables. Establishing relationship between
various Fact table(s) and Dimension tables is called “Data Modeling”. Fact table contains
numeric data that is needed in reports e.g. revenue, sales etc Fact table contain information
about all dimension table that it is related to. This means FACT table has all the Dimension
Keys as Foreign Keys.
Data Modeling is of two types:
1. Star Schema Design:
Dimension tables surrounds Fact table. Data is in de-normalized form.
1
2. Snow Flakes Schema Design:
Dimension tables surrounds Fact table. Data is in normalized form. Dimension table may be
further split into a sub-dimension table.
Informatica Tool Installation
1. Install Oracle.
2. Install Informatica Client Tools.
3. Install Informatica Server.
While Installing the Informatica Server, Give keys for all databases, Give name for
Repository (and user name and password), Give TCP/IP Port number (4001), Choose
Oracle Version.
The ODBC Driver For Oracle is “Merant-32 bit for Oracle”.
2
INFORMATICA
About PowerCenter and PowerMart
Welcome to PowerMart and PowerCenter, Informatica’s integrated suite of software
products that deliver an open, scalable solution addressing the complete life cycle for data
warehouse and analytic application development. Both PowerMart and PowerCenter
combine the latest technology enhancements for reliably managing data repositories and
delivering information resources in a timely, usable manner.
The metadata repository coordinates and drives a variety of core functions including
extraction, transformation, loading, and management. The Informatica Server can extract
large volumes of data from multiple platforms, handle complex transformations, and
support high-speed loads. PowerMart and PowerCenter can simplify and accelerate the
process of moving data warehouses from development to test to full production.
Software features that differ between the PowerMart and PowerCenter:
If You Are Using PowerCenter
With PowerCenter, you receive all product functionality, including the ability to register
multiple servers, share metadata across repositories, and partition data.
A PowerCenter license lets you create a single repository that you can configure as a global
repository, the core component of a data warehouse.
When this guide mentions a PowerCenter Server, it is referring to an Informatica Server with
a PowerCenter license.
If You Are Using PowerMart
This version of PowerMart includes all features except distributed metadata, multiple
registered servers, and data partitioning. Also, the various options available with
PowerCenter (such as PowerCenter Integration Server for BW, PowerConnect for IBM
DB2, PowerConnect for IBM MQSeries, PowerConnect for SAP R/3, PowerConnect for
Siebel, and PowerConnect for PeopleSoft) are not available with PowerMart.
When this guide mentions a PowerMart Server, it is referring to an Informatica Server with a
PowerMart license.
Informatica Client Tools:
Designer
Server Manager
Repository Manager
Informatica Server Tools:
1. Informatica Server
Load Manager Process and Data Transformation Manager Process
3
The Load Manager is the primary Informatica Server process. It performs the following
tasks:
 Manages session and batch scheduling.
 Locks the session and reads session properties.
 Reads the parameter file.
 Expands the server and session variables and parameters.
 Verifies permissions and privileges.
 Validates source and target code pages.
 Creates the session log file.
 Creates the Data Transformation Manager (DTM) process, which executes the
session.
The Data Transformation Manager (DTM) process executes the session.
DESIGNER
The Designer has five tools to help you build mappings and mapplets so you can specify
how to move and transform data between sources and targets. The Designer helps you
create source definitions, target definitions, and transformations to build your mappings.
The Designer allows you to work with multiple tools at one time and to work in multiple
folders and repositories at the same time. It also includes windows so you can view folders,
repository objects, and tasks.
Designer Tools
The Designer provides the 5 following tools:
Source Analyzer.
Used to import or create source definitions for flat file (Fixed-width and delimited
flat files), XML, COBOL, ERP, and relational sources (tables, views, and synonyms).
Warehouse Designer.
Used to import or create target definitions.
Transformation Developer.
Used to create reusable transformations.
Mapplet Designer.
Used to create mapplets.
Mapping Designer.
Used to create mappings.
What is a Transformation?
4
A transformation is a repository object that generates, modifies, or passes data. The
Designer provides a set of transformations that perform specific functions. For example, an
Aggregator transformation performs calculations on groups of data.
Transformations in a mapping represent the operations the Informatica Server performs on
the data. Data passes into and out of transformations through ports that you connect in a
mapping or mapplet.
Transformations can be active or passive. An active transformation can change the
number of rows that pass through it, such as a Filter transformation that removes rows that
do not meet the configured filter condition. A passive transformation does not change the
number of rows that pass through it, such as an Expression transformation that performs a
calculation on data and passes all rows through the transformation.
Transformations can be connected to the data flow, or they can be unconnected. An
unconnected transformation is not connected to other transformations in the mapping. It is
called within another transformation, and returns a value to that transformation.
Table 8-1 provides a brief description of each transformation:
Table 8-1. Transformation Descriptions
Transformation
Advanced
Procedure
Description
External Active/
Connected
Calls a procedure in a shared library or in the
COM layer of Windows NT.
Active/
Connected
Aggregator
ERP
Qualifier
Type
Performs aggregate calculations.
Represents the rows that the Informatica Server
reads from an ERP source when it runs a
session.
Source Active/
Connected
Expression
Passive/
Connected
External Procedure
Passive/
Connected
Unconnected
Filter
Active/
Connected
Filters records.
Input
Passive/
Connected
Defines mapplet input rows. Available only in
the Mapplet Designer.
Joiner
Active/
Connected
Joins records from different databases or flat file
systems.
Lookup
Passive/
Looks up values.
Calculates a value.
or
Calls a procedure in a shared library or in the
COM layer of Windows NT.
5
Connected
Unconnected
or
Normalizer
Active/
Connected
Normalizes records, including those read from
COBOL sources.
Output
Passive/
Connected
Defines mapplet output rows. Available only in
the Mapplet Designer.
Rank
Active/
Connected
Limits records to a top or bottom range.
Sequence Generator
Passive/
Connected
Generates primary keys.
Source Qualifier
Active/
Connected
Represents the rows that the Informatica Server
reads from a relational or flat file source when it
runs a session.
Router
Active/
Connected
Routes data into multiple transformations based
on a group expression.
Stored Procedure
Passive/
Connected
Unconnected
Update Strategy
Active/
Connected
Determines whether to insert, delete, update, or
reject records.
Source Passive/
Connected
Represents the rows that the Informatica Server
reads from an XML source when it runs a
session.
XML
Qualifier
or Calls a stored procedure.
Overview Of Transformations:
1. Aggregator
The Aggregator transformation allows you to perform aggregate calculations, such as
averages and sums. The Aggregator transformation is unlike the Expression transformation,
in that you can use the Aggregator transformation to perform calculations on groups. The
Expression transformation permits you to perform calculations on a row-by-row basis only.
6
When using the transformation language to create aggregate expressions, you can use
conditional clauses to filter records, providing more flexibility than SQL language.
The Informatica Server performs aggregate calculations as it reads, and stores necessary data
group and row data in an aggregate cache.
After you create a session that includes an Aggregator transformation, you can enable the
session option, Incremental Aggregation. When the Informatica Server performs
incremental aggregation, it passes new source data through the mapping and uses historical
cache data to perform new aggregation calculations incrementally.
2. Filter
The Filter transformation provides the means for filtering rows in a mapping. You pass all
the rows from a source transformation through the Filter transformation, and then enter a
filter condition for the transformation. All ports in a Filter transformation are input/output,
and only rows that meet the condition pass through the Filter transformation.
In some cases, you need to filter data based on one or more conditions before writing it to
targets. For example, if you have a human resources data warehouse containing information
about current employees, you might want to filter out employees who are part-time and
hourly.
The mapping in Figure 18-1 passes the rows from a human resources table that contains
employee data through a Filter transformation. The filter only allows rows through for
employees that make salaries of $30,000 or higher.
3. Joiner
While a Source Qualifier transformation can join data originating from a common source
database, the Joiner transformation joins two related heterogeneous sources residing in
different locations or file systems. The combination of sources can be varied. You can use
the following sources:
a) Two relational tables existing in separate databases
b) Two flat files in potentially different file systems
c) Two different ODBC sources
d) Two instances of the same XML source
e) A relational table and a flat file source
f) A relational table and an XML source
You use the Joiner transformation to join two sources with at least one matching port. The
Joiner transformation uses a condition that matches one or more pairs of ports between the
two sources.
For example, you might want to join a flat file with in-house customer IDs and a
relational database table that contains user-defined customer IDs. You could import the flat
file into a temporary database table, and then perform the join in the database. However, if
you use the Joiner transformation, there is no need to import or create temporary tables.
If two relational sources contain keys, then a Source Qualifier transformation can easily join
the sources on those keys. Joiner transformations typically combine information from two
7
different sources that do not have matching keys, such as flat file sources. The Joiner
transformation allows you to join sources that contain binary data.
The Joiner transformation supports the following join types, which you set in the Properties
tab:
1. Normal (Default)
2. Master Outer
3. Detail Outer
4. Full Outer
4. Source Qualifier
When you add a relational or a flat file source definition to a mapping, you need to connect
it to a Source Qualifier transformation. The Source Qualifier represents the records that the
Informatica Server reads when it runs a session.
You can use the Source Qualifier to perform the following tasks:
a) Join data originating from the same source database
You can join two or more tables with primary-foreign key relationships by linking
the sources to one Source Qualifier.
b) Filter records when the Informatica Server reads source data
If you include a filter condition, the Informatica Server adds a WHERE clause to the
default query.
c) Specify an outer join rather than the default inner join
If you include a user-defined join, the Informatica Server replaces the join
information specified by the metadata in the SQL query.
d) Specify sorted ports
If you specify a number for sorted ports, the Informatica Server adds an ORDER
BY clause to the default SQL query.
e) Select only distinct values from the source
If you choose Select Distinct, the Informatica Server adds a SELECT DISTINCT statement
to the default SQL query.
f) Create a custom query to issue a special SELECT statement for the Informatica Server to
read source data
For example, you might use a custom query to perform aggregate calculations or
execute a stored procedure.
5. Stored Procedure
8
A Stored Procedure transformation is an important tool for populating and maintaining
databases. Database administrators create stored procedures to automate time-consuming
tasks that are too complicated for standard SQL statements.
A stored procedure is a precompiled collection of Transact-SQL statements and
optional flow control statements, similar to an executable script. Stored procedures are
stored and run within the database. You can run a stored procedure with the EXECUTE
SQL statement in a database client tool, just as you can run SQL statements. Unlike standard
SQL, however, stored procedures allow user-defined variables, conditional statements, and
other powerful programming features.
Not all databases support stored procedures, and database implementations vary widely on
their syntax. You might use stored procedures to:
a) Drop and recreate indexes.
b) Check the status of a target database before moving records into it.
c) Determine if enough space exists in a database.
d) Perform a specialized calculation.
Database developers and programmers use stored procedures for various tasks within
databases, since stored procedures allow greater flexibility than SQL statements. Stored
procedures also provide error handling and logging necessary for mission critical tasks.
Developers create stored procedures in the database using the client tools provided with the
database.
The stored procedure must exist in the database before creating a Stored Procedure
transformation, and the stored procedure can exist in a source, target, or any database with a
valid connection to the Informatica Server.
You might use a stored procedure to perform a query or calculation that you would
otherwise make part of a mapping. For example, if you already have a well-tested stored
procedure for calculating sales tax, you can perform that calculation through the stored
procedure instead of recreating the same calculation in an Expression transformation.
6. Sequence Generator
The Sequence Generator transformation generates numeric values. You can use the
Sequence Generator to create unique primary key values, replace missing primary keys, or
cycle through a sequential range of numbers.
The Sequence Generator transformation is a connected transformation. It contains two
output ports that you can connect to one or more transformations. The Informatica Server
generates a value each time a row enters a connected transformation, even if that value is not
used. When NEXTVAL is connected to the input port of another transformation, the
Informatica Server generates a sequence of numbers. When CURRVAL is connected to the
input port of another transformation, the Informatica Server generates the NEXTVAL value
plus one.
9
You can make a Sequence Generator reusable, and use it in multiple mappings. You might
reuse a Sequence Generator when you perform multiple loads to a single target.
For example, if you have a large input file that you separate into three sessions running in
parallel, you can use a Sequence Generator to generate primary key values. If you use
different Sequence Generators, the Informatica Server might accidentally generate duplicate
key values. Instead, you can use the same reusable Sequence Generator for all three sessions
to provide a unique value for each target row.
7. Rank
The Rank transformation allows you to select only the top or bottom rank of data. You can
use a Rank transformation to return the largest or smallest numeric value in a port or group.
You can also use a Rank transformation to return the strings at the top or the bottom of a
session sort order. During the session, the Informatica Server caches input data until it can
perform the rank calculations.
The Rank transformation differs from the transformation functions MAX and MIN,
in that it allows you to select a group of top or bottom values, not just one value. For
example, you can use Rank to select the top 10 salespersons in a given territory. Or, to
generate a financial report, you might also use a Rank transformation to identify the three
departments with the lowest expenses in salaries and overhead. While the SQL language
provides many functions designed to handle groups of data, identifying top or bottom strata
within a set of rows is not possible using standard SQL functions.
You connect all ports representing the same row set to the transformation. Only the rows
that fall within that rank, based on some measure you set when you configure the
transformation, pass through the Rank transformation. You can also write expressions to
transform data or perform calculations.
Figure 22-1 shows a mapping that passes employee data from a human resources table
through a Rank transformation. The Rank only passes the rows for the top 10 highest paid
employees to the next transformation.
8. Look Up
Use a Lookup transformation in your mapping to look up data in a relational table, view, or
synonym. Import a lookup definition from any relational database to which both the
Informatica Client and Server can connect. You can use multiple Lookup transformations in
a mapping.
The Informatica Server queries the lookup table based on the lookup ports in the
transformation. It compares Lookup transformation port values to lookup table column
values based on the lookup condition. Use the result of the lookup to pass to other
transformations and the target.
You can use the Lookup transformation to perform many tasks, including:
10
a) Get a related value:
For example, if your source table includes employee ID, but you want to include the
employee name in your target table to make your summary data easier to read.
b) Perform a calculation:
Many normalized tables include values used in a calculation, such as gross sales per
invoice or sales tax, but not the calculated value (such as net sales).
c) Update slowly changing dimension tables:
You can use a Lookup transformation to determine whether records already exist in
the target.
You can configure the Lookup transformation to perform different types of lookups.
You can configure the transformation to be connected or unconnected, cached or uncached:
a) Connected or unconnected:
Connected and unconnected transformations receive input and send output in
different ways.
b) Cached or uncached:
Sometimes you can improve session performance by caching the lookup table. If you
cache the lookup table, you can choose to use a dynamic or static cache. By default, the
lookup cache remains static and does not change during the session. With a dynamic cache,
the Informatica Server inserts rows into the cache during the session. Informatica
recommends that you cache the target table as the lookup. This enables you to look up
values in the target and insert them if they do not exist.
9. Expression
You can use the Expression transformations to calculate values in a single row before you
write to the target. For example, you might need to adjust employee salaries, concatenate
first and last names, or convert strings to numbers. You can use the Expression
transformation to perform any non-aggregate calculations. You can also use the Expression
transformation to test conditional statements before you output the results to target tables or
other transformations.
Note: To perform calculations involving multiple rows, such as sums or averages, use the
Aggregator transformation. Unlike the Expression transformation, the Aggregator allows
you to group and sort data. For details, see Aggregator Transformation.
10. Router
A Router transformation is similar to a Filter transformation because both transformations
allow you to use a condition to test data. A Filter transformation tests data for one condition
11
and drops the rows of data that do not meet the condition. However, a Router
transformation tests data for one or more conditions and gives you the option to route rows
of data that do not meet any of the conditions to a default output group.
If you need to test the same input data based on multiple conditions, use a Router
Transformation in a mapping instead of creating multiple Filter transformations to perform
the same task. The Router transformation is more efficient when you design a mapping and
when you run a session. For example, to test data based on three conditions, you only need
one Router transformation instead of three filter transformations to perform this task.
Likewise, when you use a Router transformation in a mapping, the Informatica Server
processes the incoming data only once. When you use multiple Filter transformations in a
mapping, the Informatica Server processes the incoming data for each transformation.
11. Update Strategy
When you design your data warehouse, you need to decide what type of information to store
in targets. As part of your target table design, you need to determine whether to maintain all
the historic data or just the most recent changes.
For example, you might have a target table, T_CUSTOMERS that contains customer data.
When a customer address changes, you may want to save the original address in the table,
instead of updating that portion of the customer record. In this case, you would create a new
record containing the updated address, and preserve the original record with the old
customer address. This illustrates how you might store historical information in a target
table. However, if you want the T_CUSTOMERS table to be a snapshot of current
customer data, you would update the existing customer record and lose the original address.
The model you choose constitutes your update strategy, how to handle changes to existing
records. In PowerMart and PowerCenter, you set your update strategy at two different levels:
a) Within a session: When you configure a session, you can instruct the Informatica Server to
either treat all records in the same way (for example, treat all records as inserts), or use
instructions coded into the session mapping to flag records for different database operations.
b) Within a mapping: Within a mapping, you use the Update Strategy transformation to flag
records for insert, delete, update, or reject.
2. SERVER MANAGER
The Informatica Server moves data from sources to targets based on mapping and session
metadata stored in a repository.
What is a Mapping?
A mapping is a set of source and target definitions linked by transformation objects that
define the rules for data transformation.
12
What is a Session?
A session is a set of instructions that describes how and when to move data from sources to
targets.
Use the Designer to import source and target definitions into the repository and to build
mappings.
Use the Server Manager to create and manage sessions and batches, and to monitor and stop
the Informatica Server.
When a session starts, the Informatica Server retrieves mapping and session metadata from
the repository to extract data from the source, transform it, and load it into the target.
More about a Session
A session is a set of instructions that tells the Informatica Server how and when to move
data from sources to targets. You create and maintain sessions in the Server Manager.
When you create a session, you enter general information such as the session name, session
schedule, and the Informatica Server to run the session. You can also select options to
execute pre-session shell commands, send post-session email, and FTP source and target
files. Using session properties, you can also override parameters established in the mapping,
such as source and target location, source and target type, error tracing levels, and
transformation attributes. For details on server activity while executing a session, see
Understanding the Server Architecture.
You can group sessions into a batch. The Informatica Server can run the sessions in a batch
in sequential order, or start them concurrently. Some batch settings override session settings.
Once you create a session, you can use either the Server Manager or the command line
program pmcmd to start or stop the session. You can also use the Server Manager to monitor,
edit, schedule, abort, copy, and delete the session.
What is a Batch?
Batches provide a way to group sessions for either serial or parallel execution by the
Informatica Server. There are two types of batches:
a) Sequential: Runs sessions one after the other.
b) Concurrent: Runs sessions at the same time.
You might create a sequential batch if you have sessions with source-target dependencies
that you want to run in a specific order. You might also create a concurrent batch if you
have several independent sessions you need scheduled at the same time. You can place them
all in one batch, then schedule the batch as needed instead of scheduling each individual
session.
13
You can create, edit, start, schedule, and stop batches with the Server Manager. However,
you cannot copy or abort batches. With pmcmd, you can start and stop batches.
3) REPOSITORY MANAGER
14
The Informatica repository is a relational database that stores information, or metadata, used
by the Informatica Server and Client tools. Metadata can include information such as
mappings describing how to transform source data, sessions indicating when you want the
Informatica Server to perform the transformations, and connect strings for sources and
targets. The repository also stores administrative information such as usernames and
passwords, permissions and privileges, and product version.
You create and maintain the repository with the Repository Manager client tool. With the
Repository Manager, you can also create folders to organize metadata and groups to organize
users.
The Informatica repository is an integral part of a data mart. A data mart includes the
following components:
a) Targets: The data mart includes one or more databases or flat file systems that store the
information used for decision support.
b) A server engine: Every data mart needs some kind of server application that reads,
transforms, and writes data to targets. In traditional data warehouses, this server application
consists of COBOL or SQL code you write to perform these operations. In PowerMart and
PowerCenter, you use a single server application that runs on UNIX or Windows NT to
read, transform, and write data.
c) Metadata: Designing a data mart involves writing and storing a complex set of
instructions. You need to know where to get data (sources), how to change it, and where to
write the information (targets). PowerMart and PowerCenter call this set of instructions
metadata. Each piece of metadata (for example, the description of a source table in an
operational database) can contain comments about it.
d) A repository: The place where you store the metadata is called a repository. The more
sophisticated your repository, the more complex and detailed metadata you can store in it.
PowerMart and PowerCenter use a relational database as the repository.
15
IMPROVING MAPPING PERFORMANCE - TIPS
1. Aggregator Transformation
You can use the following guidelines to optimize the performance of an Aggregator
transformation.
a) Use Sorted Input to decrease the use of aggregate caches:
The Sorted Input option reduces the amount of data cached during the session and
improves session performance. Use this option with the Source Qualifier Number of Sorted
Ports option to pass sorted data to the Aggregator transformation.
b) Limit connected input/output or output ports:
Limit the number of connected input/output or output ports to reduce the amount
of data the Aggregator transformation stores in the data cache.
c) Filter before aggregating:
If you use a Filter transformation in the mapping, place the transformation before
the Aggregator transformation to reduce unnecessary aggregation.
2. Filter Transformation
The following tips can help filter performance:
a) Use the Filter transformation early in the mapping:
To maximize session performance, keep the Filter transformation as close as
possible to the sources in the mapping. Rather than passing rows that you plan to discard
through the mapping, you can filter out unwanted data early in the flow of data from sources
to targets.
b) Use the Source Qualifier to filter:
The Source Qualifier transformation provides an alternate way to filter rows. Rather
than filtering rows from within a mapping, the Source Qualifier transformation filters rows
when read from a source. The main difference is that the source qualifier limits the row set
extracted from a source, while the Filter transformation limits the row set sent to a target. Since a
source qualifier reduces the number of rows used throughout the mapping, it provides better
performance.
However, the source qualifier only lets you filter rows from relational sources, while
the Filter transformation filters rows from any type of source. Also, note that since it runs in
16
the database, you must make sure that the source qualifier filter condition only uses standard
SQL. The Filter transformation can define a condition using any statement or
transformation function that returns either a TRUE or FALSE value.
3. Joiner Transformation
The following tips can help improve session performance:
a) Perform joins in a database:
Performing a join in a database is faster than performing a join in the session. In
some cases, this is not possible, such as joining tables from two different databases or flat
file systems. If you want to perform a join in a database, you can use the following options:
 Create a pre-session stored procedure to join the tables in a database before
running the mapping.
 Use the Source Qualifier transformation to perform the join.
b) Designate as the master source the source with the smaller number of records:
For optimal performance and disk storage, designate the master source as the source
with the lower number of rows. With a smaller master source, the data cache is smaller, and
the search time is shorter.
4. LookUp Transformation
Use the following tips when you configure the Lookup transformation:
a) Add an index to the columns used in a lookup condition:
If you have privileges to modify the database containing a lookup table, you can
improve performance for both cached and uncached lookups. This is important for very
large lookup tables. Since the Informatica Server needs to query, sort, and compare values in
these columns, the index needs to include every column used in a lookup condition.
b) Place conditions with an equality operator (=) first:
If a Lookup transformation specifies several conditions, you can improve lookup
performance by placing all the conditions that use the equality operator first in the list of
conditions that appear under the Condition tab.
c) Cache small lookup tables:
Improve session performance by caching small lookup tables. The result of the
Lookup query and processing is the same, regardless of whether you cache the lookup table
or not.
d) Join tables in the database:
If the lookup table is on the same database as the source table in your mapping and
caching is not feasible, join the tables in the source database rather than using a Lookup
transformation.
17
e) Un Select the cache look-up option in Look Up transformation if there is no look up
over-ride. This improves performance of session.
MAPPING VARIABLES
1. Go to Mappings Tab, Click Parameters and Variables Tab, Create a NEW port as below.
$$LastRunTime
Variable
date/time
19
0
Max
Give an Initial Value. For example 1/1/1900.
2. IN EXP Transformation, Create Variable as below:
SetLastRunTime (date/time) = SETVARIABLE ($$LastRunTime, SESSSTARTTIME)
3. Go to SOURCE QUALIFIER Transformation,
Click Properties Tab, In Source Filter area, ENTER the following Expression.
UpdateDateTime (Any Date Column from source) >= '$$LastRunTime'
AND
18
UpdateDateTime < '$$$SessStartTime'
Handle Nulls in DATE
iif(isnull(AgedDate),to_date('1/1/1900','MM/DD/YYYY'),trunc(AgedDate,'DAY'))
LOOK UP AND UPDATE STRATEGY EXPRESSION
First, declare a Look Up condition in Look Up Transformation.
For example,
EMPID_IN (column coming from source) = EMPID (column in target table)
Second, drag and drop these two columns into UPDATE Strategy Transformation.
Check the Value coming from source (EMPID_IN) with the column in the target table
(EMPID). If both are equal this means that the record already exists in the target. So we
need to update the record (DD_UPDATE). Else will insert the record coming from source
into the target (DD_INSERT). See below for UPDATE Strategy expression.
IIF ((EMPID_IN = EMPID), DD_UPDATE, DD_INSERT)
NOTE: Always the Update Strategy expression should be based on Primary keys in the
target table.
EXPRESSION TRANSFORMATION
1.
IIF (ISNULL (ServiceOrderDateValue1),
TO_DATE ('1/1/1900','MM/DD/YYYY'), TRUNC
(ServiceOrderDateValue1,'DAY'))
2.
IIF (ISNULL (NpaNxxId1) or LENGTH (RTRIM (NpaNxxId1))=0 or TO_NUMBER
(NpaNxxId1) <= 0,'UNK', NpaNxxId1)
3.
IIF (ISNULL (InstallMethodId),0,InstallMethodId)
19
Date_Diff(TRUNC(O_ServiceOrderDateValue),TRUNC(O_ServiceOrderDateValue),
'DD')
FILTER CONDITION
To pass only NOT
TRANSFORMATION.
NULL
AND
NOT
SPACES
VALUES
THROUGH
IIF (
ISNULL(LENGTH(RTRIM(LTRIM(ADSLTN))))
,0
,LENGTH(RTRIM(LTRIM(ADSLTN))))>0
SECOND FILTER CONDITION
[Pass only NOT NULL FROM
FILTER]
iif(isnull(USER_NAME),FALSE,TRUE)
20
PERFORMANCE TIPS IN GENERAL
Most of the gains in performance derive from good database design, thorough query
analysis, and appropriate indexing. The largest performance gains can be realized by
establishing a good database design.
1. Update Table Statistics in database.
SYBASE SYNTAX:
update all statistics table_name
Adaptive Server’s cost-based optimizer uses statistics about the tables, indexes, and columns
named in a query to estimate query costs. It chooses the access method that the optimizer
determines has the least cost. But this cost estimate cannot be accurate if statistics are not
accurate. Some statistics, such as the number of pages or rows in a table, are updated during
query processing. Other statistics, such as the histograms on
columns, are only updated when you run the update statistics command or when indexes are
created.
If you are having problems with a query performing slowly, and seek help from Technical
Support or a Sybase news group on the Internet, one of the first questions you are likely be
asked is “Did you run update statistics?” You can use the optdiag command (IN SYBASE)
to see the time update statistics was last run for each column on which statistics exist:
NOTE:
 Running the update statistics commands requires system resources. Like other
maintenance tasks, it should be scheduled at times when load on the server is light. In
particular, update statistics requires table scans or leaf-level scans of indexes, may increase
I/O contention, may use the CPU to perform sorts, and uses the data and procedure caches.
Use of these resources can adversely affect queries running on the server if you run update
statistics at times when usage is high. In addition, some update statistics commands require
shared locks, which can block updates.
• Dropping an index does not drop the statistics for the index, since the optimizer can use
column-level statistics to estimate costs, even when no index exists. If you want to remove
the statistics after dropping an index, you must explicitly delete them with delete statistics.
21
• Truncating a table does not delete the column-level statistics in sysstatistics. In many cases,
tables are truncated and the same data is reloaded. Since truncate table does not delete the
column-level statistics, there is no need to run update statistics after the table is reloaded, if
the data is the same. If you reload the table with data that has a different distribution of key
values, you need to run update statistics.
• You can drop and re-create indexes without affecting the index statistics, by specifying 0
for the number of steps in the with statistics clause to create index. This create index
command does not affect the statistics in sysstatistics (IN SYBASE):
Create index title_id_ix on titles (title_id) with statistics using 0 values
This allows you to re-create an index without overwriting statistics that have been edited
with optdiag.
• If two users attempt to create an index on the same table, with the same columns, at the
same time, one of the commands may fail due to an attempt to enter a duplicate key value in
sysstatistics.
2. Create Indexes on KEY fields. Keep Index statistics up to date.
NOTE: If data modification performance is poor, you may have too many indexes. While
indexes favor “select operations”, they slow down “data modifications”.
ABOUT INDEXES
Indexes are the most important physical design element in improving database performance:
• Indexes help prevent table scans. Instead of reading hundreds of data pages, a few index
pages and data pages can satisfy many queries.
• For some queries, data can be retrieved from a nonclustered index without ever accessing
the data rows.
• Clustered indexes can randomize data inserts, avoiding insert “hot spots” on the last page
of a table.
• Indexes can help avoid sorts, if the index order matches the order of columns in an order
by clause.
In addition to their performance benefits, indexes can enforce the uniqueness of data.
Indexes are database objects that can be created for a table to speed direct access to specific
data rows. Indexes store the values of the key(s) that were named when the index was
created, and logical pointers to the data pages or to other index pages.
Adaptive Server (SYBASE) provides two types of indexes:
22
• Clustered indexes, where the table data is physically stored in the order of the keys on the
index:
• For allpages-locked tables, rows are stored in key order on pages, and pages are linked in
key order.
• For data-only-locked tables, indexes are used to direct the storage of data on rows and
pages, but strict key ordering is not maintained.
• Nonclustered indexes, where the storage order of data in the table is not related to index
keys.
You can create only one clustered index on a table because there is only one possible
physical ordering of the data rows. You can create up to 249 nonclustered indexes per table.
A table that has no clustered index is called a “heap”.
3. Drop and Re-create the Indexes that hurt performance.
Drop Indexes (In Pre-Session) before inserting data AND Re-Create Indexes (In PostSession) after data is inserted.
NOTE: With indexes, inserting data is slower.
Drop indexes that hurt performance. If an application performs data modifications during
the day and generates reports at night, you may want to drop some indexes in the morning
and re-create them at night. Drop indexes during periods when frequent updates occur and
rebuild them before periods when frequent selects occur.
4. Also you can improve performance by



Using transaction log thresholds to automate log dumps and “avoid running out of
space”.
Using thresholds for space monitoring in data segments.
Using partitions to speed loading of data.
5. To tune the SQL Query
We can use “Parallel Hints” in the SELECT stmt of SQL Query. Also use the table with
large no. of rows last when joining. In other sense, use the table with less no. of rows as a
MASTER source. Also Queries that contain ORDER BY or GROUP BY clauses may
benefit from creating an index on the ORDER BY or GROUP BY columns. Once you
optimize the query, use the SQL override option to take full advantage of these
modifications.
23
6. Registering Multiple Servers
Also performance can be increased by registering multiple servers which point to same
repository.
Other methods to Improve Performance
Optimizing the Target Database
If your session writes to a flat file target, you can optimize session performance by writing to
a flat file target that is local to the Informatica Server.
If your session writes to a relational target, consider performing the following tasks to
increase performance:
 Drop indexes and key constraints.
 Increase checkpoint intervals.
 Use bulk loading.
 Use external loading.
 Turn off recovery.
 Increase database network packet size.
24

Optimize Oracle target databases.
Dropping Indexes and Key Constraints
When you define key constraints or indexes in target tables, you slow the loading of data to
those tables. To improve performance, drop indexes and key constraints before running
your session. You can rebuild those indexes and key constraints after the session completes.
If you decide to drop and rebuild indexes and key constraints on a regular basis, you
can create pre- and post-load stored procedures to perform these operations each time you
run the session.
Note: To optimize performance, use constraint-based loading only if necessary.
Increasing Checkpoint Intervals
The Informatica Server performance slows each time it waits for the database to perform a
checkpoint. To increase performance, consider increasing the database checkpoint interval.
When you increase the database checkpoint interval, you increase the likelihood that the
database performs checkpoints as necessary, when the size of the database log file reaches its
limit.
Bulk Loading on Sybase and Microsoft SQL Server
You can use bulk loading to improve the performance of a session that inserts a large
amount of data to a Sybase or Microsoft SQL Server database. Configure bulk loading on
the Targets dialog box in the session properties.
When bulk loading, the Informatica Server bypasses the database log, which speeds
performance. Without writing to the database log, however, the target database cannot
perform rollback. As a result, the Informatica Server cannot perform recovery of the session.
Therefore, you must weigh the importance of improved session performance against the
ability to recover an incomplete session.
If you have indexes or key constraints on your target tables and you want to enable
bulk loading, you must drop the indexes and constraints before running the session. After
the session completes, you can rebuild them. If you decide to use bulk loading with the
session on a regular basis, you can create pre- and post-load stored procedures to drop and
rebuild indexes and key constraints.
For other databases, even if you configure the bulk loading option, Informatica
Server ignores the commit interval mentioned and commits as needed.
External Loading on Teradata, Oracle, and Sybase IQ
You can use the External Loader session option to integrate external loading with a session.
If you have a Teradata target database, you can use the Teradata external loader utility to
bulk load target files.
If your target database runs on Oracle, you can use the Oracle SQL*Loader utility to
bulk load target files. When you load data to an Oracle database using a partitioned session,
25
you can increase performance if you create the Oracle target table with the same number of
partitions you use for the session.
If your target database runs on Sybase IQ, you can use the Sybase IQ external loader
utility to bulk load target files. If your Sybase IQ database is local to the Informatica Server
on your UNIX system, you can increase performance by loading data to target tables directly
from named pipes. Use pmconfig to enable the SybaseIQLocaltoPMServer option. When
you enable this option, the Informatica Server loads data directly from named pipes rather
than writing to a flat file for the Sybase IQ external loader.
Increasing Database Network Packet Size
You can increase the network packet size in the Informatica Server Manager to reduce target
bottleneck. For Sybase and Microsoft SQL Server, increase the network packet size to 8K 16K. For Oracle, increase the network packet size in tnsnames.ora and listener.ora. If you
increase the network packet size in the Informatica Server configuration, you also need to
configure the database server network memory to accept larger packet sizes.
Optimizing Oracle Target Databases
If your target database is Oracle, you can optimize the target database by checking the
storage clause, space allocation, and rollback segments.
When you write to an Oracle database, check the storage clause for database objects.
Make sure that tables are using large initial and next values. The database should also store
table and index data in separate tablespaces, preferably on different disks.
When you write to Oracle target databases, the database uses rollback segments
during loads. Make sure that the database stores rollback segments in appropriate
tablespaces, preferably on different disks. The rollback segments should also have
appropriate storage clauses.
You can optimize the Oracle target database by tuning the Oracle redo log. The
Oracle database uses the redo log to log loading operations. Make sure that redo log size and
buffer size are optimal. You can view redo log properties in the init.ora file.
If your Oracle instance is local to the Informatica Server, you can optimize
performance by using IPC protocol to connect to the Oracle database. You can set up
Oracle database connection in listener.ora and tnsnames.ora.
26
Improving Performance at mapping level
Optimizing Datatype Conversions
Forcing the Informatica Server to make unnecessary datatype conversions slows
performance.
For example, if your mapping moves data from an Integer column to a Decimal column,
then back to an Integer column, the unnecessary datatype conversion slows performance.
Where possible, eliminate unnecessary datatype conversions from mappings.
Some datatype conversions can improve system performance. Use integer values in place of
other datatypes when performing comparisons using Lookup and Filter transformations.
For example, many databases store U.S. zip code information as a Char or Varchar datatype.
If you convert your zip code data to an Integer datatype, the lookup database stores the zip
code 94303-1234 as 943031234. This helps increase the speed of the lookup comparisons
based on zip code.
Optimizing Lookup Transformations
If a mapping contains a Lookup transformation, you can optimize the lookup. Some of the
things you can do to increase performance include caching the lookup table, optimizing the
lookup condition, or indexing the lookup table.
Caching Lookups
If a mapping contains Lookup transformations, you might want to enable lookup caching. In
general, you want to cache lookup tables that need less than 300MB.
When you enable caching, the Informatica Server caches the lookup table and queries the
lookup cache during the session. When this option is not enabled, the Informatica Server
queries the lookup table on a row-by-row basis. You can increase performance using a
shared or persistent cache:
Shared cache. You can share the lookup cache between multiple transformations. You can
share an unnamed cache between transformations in the same mapping. You can share a
named cache between transformations in the same or different mappings.
Persistent cache. If you want to save and reuse the cache files, you can configure the
transformation to use a persistent cache. Use this feature when you know the lookup table
does not change between session runs. Using a persistent cache can improve performance
because the Informatica Server builds the memory cache from the cache files instead of
from the database.
Reducing the Number of Cached Rows
Use the Lookup SQL Override option to add a WHERE clause to the default SQL
statement. This allows you to reduce the number of rows included in the cache.
27
Optimizing the Lookup Condition
If you include more than one lookup condition, place the conditions with an equal sign first
to optimize lookup performance.
Indexing the Lookup Table
The Informatica Server needs to query, sort, and compare values in the lookup condition
columns. The index needs to include every column used in a lookup condition. You can
improve performance for both cached and uncached lookups:
Cached lookups. You can improve performance by indexing the columns in the lookup
ORDER BY. The session log contains the ORDER BY statement.
Uncached lookups. Because the Informatica Server issues a SELECT statement for each
row passing into the Lookup transformation, you can improve performance by indexing the
columns in the lookup condition.
Improving Performance at Repository level
Tuning Repository Performance
The PowerMart and PowerCenter repository has more than 80 tables and almost all tables
use one or more indexes to speed up queries. Most databases keep and use column
distribution statistics to determine which index to use to execute SQL queries optimally.
Database servers do not update these statistics continuously.
In frequently-used repositories, these statistics can become outdated very quickly and SQL
query optimizers may choose a less than optimal query plan. In large repositories, the impact
of choosing a sub-optimal query plan can affect performance drastically. Over time, the
repository becomes slower and slower.
To optimize SQL queries, you might update these statistics regularly. The frequency of
updating statistics depends on how heavily the repository is used. Updating statistics is done
table by table. The database administrator can create scripts to automate the task.
You can use the following information to generate scripts to update distribution statistics.
Note: All PowerMart/PowerCenter repository tables and index names begin with “OPB_”.
Oracle Database
28
You can generate scripts to update distribution statistics for an Oracle repository.
To generate scripts for an Oracle repository:
1. Run the following queries:
select 'analyze table ', table_name, ' compute statistics;' from user_tables where
table_name like 'OPB_%'
select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes
where INDEX_NAME like 'OPB_%'
This produces an output like the following:
'ANALYZETABLE' TABLE_NAME
'COMPUTESTATISTICS;'
-------------- ---------------- ----------------------------------------------------------------------------analyze table
OPB_ANALYZE_DEP
compute statistics;
analyze table
OPB_ATTR
compute statistics;
analyze table
OPB_BATCH_OBJECT
compute statistics;
2. Save the output to a file.
3. Edit the file and remove all the headers.
Headers are like the following:
'ANALYZETABLE' TABLE_NAME
'COMPUTESTATISTICS;'
-------------- ---------------- -------------------4. Run this as an SQL script. This updates repository table statistics.
Microsoft SQL Server
29
You can generate scripts to update distribution statistics for a Microsoft SQL Server
repository.
To generate scripts for a Microsoft SQL Server repository:
1. Run the following query:
select 'update statistics ', name from sysobjects where name like 'OPB_%'
This produces an output like the following:
name
------------------ -----------------update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
2. Save the output to a file.
3. Edit the file and remove the header information.
Headers are like the following:
name
------------------ -----------------4. Add a go at the end of the file.
5. Run this as a sql script. This updates repository table statistics.
Improving Performance at Session level
Optimizing the Session
Once you optimize your source database, target database, and mapping, you can focus on
optimizing the session. You can perform the following tasks to improve overall
performance:
30





Run concurrent batches.
Partition sessions.
Reduce errors tracing.
Remove staging areas.
Tune session parameters.
Table 19-1 lists the settings and values you can use to improve session performance:
Table 19-1. Session Tuning Parameters
Setting
Default Value
DTM
Buffer 12,000,000
Pool Size
[12 MB]
bytes
Suggested
Minimum Value
Suggested
Maximum Value
6,000,000 bytes
128,000,000 bytes
Buffer block size
64,000 bytes
[64 KB]
4,000 bytes
128,000 bytes
Index cache size
1,000,000 bytes
1,000,000 bytes
12,000,000 bytes
Data cache size
2,000,000 bytes
2,000,000 bytes
24,000,000 bytes
Commit interval
10,000 rows
N/A
N/A
Decimal
arithmetic
Disabled
N/A
N/A
Tracing Level
Normal
Terse
N/A
How to correct and load the rejected files when the session completes
During a session, the Informatica Server creates a reject file for each target instance in the
mapping. If the writer or the target rejects data, the Informatica Server writes the rejected
row into the reject file. By default, the Informatica Server creates reject files in the
$PMBadFileDir server variable directory.
The reject file and session log contain information that helps you determine the cause of the
reject. You can correct reject files and load them to relational targets using the Informatica
reject loader utility. The reject loader also creates another reject file for the data that the
writer or target reject during the reject loading.
Complete the following tasks to load reject data into the target:



Locate the reject file.
Correct bad data.
Run the reject loader utility.
31

NOTE: You cannot load rejected data into a flat file target
After you locate a reject file, you can read it using a text editor that supports the reject file
code page.
Reject files contain rows of data rejected by the writer or the target database. Though the
Informatica Server writes the entire row in the reject file, the problem generally centers on
one column within the row. To help you determine which column caused the row to be
rejected, the Informatica Server adds row and column indicators to give you more
information about each column:


Row indicator. The first column in each row of the reject file is the row indicator.
The numeric indicator tells whether the row was marked for insert, update, delete, or
reject.
Column indicator. Column indicators appear after every column of data. The
alphabetical character indicators tell whether the data was valid, overflow, null, or
truncated.
The following sample reject file shows the row and column indicators:
3,D,1,D,,D,0,D,1094945255,D,0.00,D,-0.00,D
0,D,1,D,April,D,1997,D,1,D,-1364.22,D,-1364.22,D
0,D,1,D,April,D,2000,D,1,D,2560974.96,D,2560974.96,D
3,D,1,D,April,D,2000,D,0,D,0.00,D,0.00,D
0,D,1,D,August,D,1997,D,2,D,2283.76,D,4567.53,D
0,D,3,D,December,D,1999,D,1,D,273825.03,D,273825.03,D
0,D,1,D,September,D,1997,D,1,D,0.00,D,0.00,D
Row Indicators
The first column in the reject file is the row indicator. The number listed as the row
indicator tells the writer what to do with the row of data.
Table 15-1 describes the row indicators in a reject file:
Table 15-1. Row Indicators in Reject File
Row Indicator Meaning Rejected By
0
Insert
Writer or target
1
Update
Writer or target
2
Delete
Writer or target
3
Reject
Writer
32
If a row indicator is 3, the writer rejected the row because an update strategy expression
marked it for reject.
If a row indicator is 0, 1, or 2, either the writer or the target database rejected the row. To
narrow down the reason why rows marked 0, 1, or 2 were rejected, review the column
indicators and consult the session log.
Column Indicators
After the row indicator is a column indicator, followed by the first column of data, and
another column indicator. Column indicators appear after every column of data and define
the type of the data preceding it.
Table 15-2 describes the column indicators in a reject file:
Table 15-2. Column Indicators in Reject File
Column
Indicator
Type of data
Writer Treats As
D
Valid data.
Good data. Writer passes it to the target
database. The target accepts it unless a
database error occurs, such as finding a
duplicate key.
O
Overflow. Numeric data exceeded Bad data, if you configured the mapping
the specified precision or scale for target to reject overflow or truncated
the column.
data.
N
Good data. Writer passes it to the target,
Null. The column contains a null
which rejects it if the target database does
value.
not accept null values.
T
Truncated. String data exceeded a Bad data, if you configured the mapping
specified precision for the column, target to reject overflow or truncated
so the Informatica Server truncated data.
it.
After you correct the target data in each of the reject files, append “.in” to each reject file
you want to load into the target database. For example, after you correct the reject file,
t_AvgSales_1.bad, you can rename it t_AvgSales_1.bad.in.
After you correct the reject file and rename it to reject_file.in, you can use the reject loader to
send those files through the writer to the target database.
33
Use the reject loader utility from the command line to load rejected files into target tables.
The syntax for reject loading differs on UNIX and Windows NT/2000 platforms.
Use the following syntax for UNIX:
pmrejldr pmserver.cfg [folder_name:]session_name
Use the following syntax for Windows NT/2000:
pmrejldr [folder_name:]session_name
34
Recovering Sessions
If you stop a session or if an error causes a session to stop, refer to the session and error logs
to determine the cause of failure. Correct the errors, and then complete the session. The
method you use to complete the session depends on the properties of the mapping, session,
and Informatica Server configuration.
Use one of the following methods to complete the session:
 Run the session again if the Informatica Server has not issued a commit.
 Truncate the target tables and run the session again if the session is not recoverable.
 Consider performing recovery if the Informatica Server has issued at least one
commit.
When the Informatica Server starts a recovery session, it reads the
OPB_SRVR_RECOVERY table and notes the row ID of the last row committed to the
target database. The Informatica Server then reads all sources again and starts processing
from the next row ID. For example, if the Informatica Server commits 10,000 rows before
the session fails, when you run recovery, the Informatica Server bypasses the rows up to
10,000 and starts loading with row 10,001. The commit point may be different for sourceand target-based commits.
By default, Perform Recovery is disabled in the Informatica Server setup. You must enable
Recovery in the Informatica Server setup before you run a session so the Informatica Server
can create and/or write entries in the OPB_SRVR_RECOVERY table.
Causes for Session Failure
 Reader errors. Errors encountered by the Informatica Server while reading the
source database or source files. Reader threshold errors can include alignment errors
while running a session in Unicode mode.
 Writer errors. Errors encountered by the Informatica Server while writing to the
target database or target files. Writer threshold errors can include key constraint
violations, loading nulls into a not null field, and database trigger responses.
 Transformation errors. Errors encountered by the Informatica Server while
transforming data. Transformation threshold errors can include conversion errors,
and any condition set up as an ERROR, such as null input.
Fatal Error
A fatal error occurs when the Informatica Server cannot access the source, target, or
repository. This can include loss of connection or target database errors, such as lack of
database space to load data. If the session uses a Normalizer or Sequence Generator
transformation, the Informatica Server cannot update the sequence values in the repository,
and a fatal error occurs.
35
36