Download Understanding Code Pages and Character Conversion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Database wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Understanding Code Pages and Character
Conversion
© 2008-2009 Informatica Corporation
Abstract
Code page character conversion can occur when data passes between databases, database clients, and PowerCenter
connection objects that do not use the same code pages. Character conversion can result in data corruption, data
truncation, or data overflow. To avoid unexpected character conversion, it is important to understand how code pages
affect the characters you process. This article describes how to configure codepages for database clients and
PowerCenter connection objects. It also explains where character conversion occurs and the impact of character
conversion.
Table of Contents
Overview ........................................................................................................................................................................... 2
Configuring Code Page Settings....................................................................................................................................... 3
Configuring the Connection Object Code Page ................................................................................................................ 4
Configuring the Reader Connection Object.................................................................................................................. 4
Configuring the Writer Connection Object .................................................................................................................... 4
Configuring the Database and Database Client Code Page............................................................................................. 5
Writing Data to a Database .......................................................................................................................................... 5
Reading Data from a Database .................................................................................................................................... 5
Data Truncation ................................................................................................................................................................ 5
Example of Data Truncation ......................................................................................................................................... 6
Troubleshooting ................................................................................................................................................................ 7
How to Check the Database Client Settings................................................................................................................. 7
How to Manage Incorrect Data Encoding Within a Database ...................................................................................... 8
Overview
Code pages specify the encoding of characters by mapping each character to a hexadecimal value. Code page
character conversion can occur when data passes between databases, database clients, and PowerCenter connection
objects that do not use the same code page. When the code pages are not the same, characters can convert to
incorrect or undefined values. The result of character conversion can be data corruption, data truncation, or data
overflow.
Character conversion can happen during a PowerCenter session. When the PowerCenter sessions read and write data
in Unicode data movement mode, the following steps occur:
1. The Integration Service spawns a reader thread to read the data. The reader thread uses a connection object to
connect to the database client.
2. The database client reads data from the database. If the database client and the database are not compatible,
unexpected character conversion can occur.
3. The database client sends the data to the reader connection object.
4. The reader thread converts the data from the code page of the reader connection object to the UCS-2 code page.
If the reader connection object and the source data use different code pages, unexpected character conversion
can occur.
5. The Integration Service processes the data and sends the target data to the writer thread.
6. The writer thread converts data from UCS-2 code page to the code page of the writer connection object. The
writer connection object sends target data to the database client.
7. The database client writes data to the database. If the database client codepage is not compatible with the
database code page, unexpected character conversion can occur.
To avoid unexpected character conversion, use the same code page for the PowerCenter connection object, the
source/target data, the database client, and the database.
2
Configuring Code Page Settings
You can set code page values for the following objects:
PowerCenter connection object. Integration Service uses the connection object code page to read or write data.
Configure the connection object code page in the session connection object.
Source/target database character set. The database code page specifies the code page of the character data
within the database. Configure the code page of the database when you create it.
Database client. The database client driver uses the code page of the database client to read data from and write
data to the database. Configure the code page of the database client on each machine that runs the Integration
Service.
When you configure the connection objects, database, and database client to use the same code as the data,
unexpected character conversion does not occur. In Figure 1, each object in the session uses the same codepage:
Figure 1. No Unexpected Character Conversion Occurs
Since the Integration Service processes data internally using the UCS-2 code page, the following expected data
conversion occurs:
The reader thread converts source data to the UCS-2 encoding.
The writer thread converts target data from UCS-2 to the code page of the writer connection object.
3
Configuring the Connection Object Code Page
When the Integration Service runs in Unicode data movement mode, the Integration Service processes data internally
using the UCS-2 code page. The reader and writer threads convert the data to and from UCS-2 encoding, based on
the code page specified in the connection object.
Character conversion can occur when the Integration Service runs in Unicode data movement mode and the data does
not use the same code page as the connection object. Configure the code page for the connection object code page to
be the same as the code page of the data that it reads or writes. For example, if the database client converts the ISO8859-1 data from an ISO-8859-1 database to UTF-8, set the reader connection object code page to UTF-8. If you read
ISO-8859-1 encoded data from a UTF-8 database and the database client uses the UTF-8 code page, set the reader
connection object code page to ISO-8859-1.
Configuring the Reader Connection Object
The reader connection object code page should be the same as the code page of the data that it reads. When the code
page specified in the reader connection object does not match the code page that the source data is encoded with, the
UCS-2 data that the Integration Service processes can be corrupted.
In Figure 2, data corruption occurs because the reader connection object uses the MS932 code page but the data is
encoded as ISO-8859-1:
Figure 2. Reader Converts Data Incorrectly
In Figure 2, the reader converts the source data from ISO-8859-1 encoding to UCS-2 encoding using the MS932 code
page. Since the code page is not associated with the encoding of the data that the reader receives, the Integration
Service may process data that contain incorrect or undefined values.
Configuring the Writer Connection Object
Configure the writer connection object code page to the same code page as the data that it writes. To write target data
in the same encoding as the source data, configure the writer connection object code page to the same code page as
the reader connection object. To write target data in a different encoding than the source data, configure the writer
connection objects code page to the code page value as the data you want to write.
In Figure 3, character conversion occurs because Integration Service reads data encoded as ISO-8859-1 but writes the
data encoded as UTF-8:
Figure 3. Writer Converts Data to a Different Encoding
In Figure 3, the reader thread converts the data from ISO-8859-1 encoding to UCS-2 encoding and the writer thread
converts the data from UCS-2 encoding to UTF-8 encoding. The Integration Service reader and writer threads convert
4
the character values without data corruption. However, the byte size of each character can increase when the writer
thread converts data, since UTF-8 is a multibyte code page. Increased byte size can cause data truncation or overflow.
Note: When you convert data to a different encoding, verify that the characters from the original codepage also exist in
new codepage.
Configuring the Database and Database Client Code Page
Configure the database client code page to the same code page as the database. Character conversion can occur
when the database client code page and the database are not compatible.
The Integration Service uses the database client to read and write data to the database. You configure the database
client code page on the machine where the Integration Service runs. If you have multiple databases of the same type
that require different database client code pages, create an Integration Service for each database client.
In a session, sources and targets that are of the same database type share the same database client. Therefore, the
source database client code page and the target database client code pages use the same code page.
Writing Data to a Database
When the database client and the database use the same code page, the database client writes data to the database
in the code page of the writer connection object.
When the database client and the database do not use the same code page, the database client converts the data to
the code page of the database.
In Figure 4, character conversion occurs because the database client code page and the database code page do not
use the same code page:
Figure 4. Database Client Converts Data to the Encoding of the Database
The database client converts the target data from UTF-8 encoding to ISO-8859-1 encoding before it writes the target
data to the database. As a result, the database client writes the target data encoded as ISO-8859-1 instead of the code
page of the data that it receives from the writer connection object.
Reading Data from a Database
When the database and the database client use the same code page, the database client does not convert data that it
reads to the code page of the database client. Therefore, if you want to read data from a database that uses a different
code page than the data, verify that database client and the database use the same code page.
When the database and the database client do not use the same code page, the database client converts the data to
the code page of the database client before it sends the data to the reader connection object.
Data Truncation
Data truncation can occur as a result of character conversion. When character conversion occurs, the number of bytes
required to store the same data may change. For example, UTF-8 characters use one to four bytes and ISO-8859-1
characters use one byte.
Data truncation can occur in the following situations:
Reading data. Data truncation occurs when the Integration Service reader receives more bytes from a source than
what is configured in the source qualifier.
5
Writing data. Data truncation or overflow occurs when the database client attempts to write more bytes to the
target than what is allocated in the precision of the target database table.
Use the following guidelines to avoid data truncation or overflow:
Verify that the source qualifier uses the same precision as the source data that it processes.
Verify that the target definition uses the same precision as the target database table.
When you migrate single byte data to a multibyte database, set the precision in the target database to three times
the size of the single byte data.
When you convert data to a different encoding, verify that the target database tables allocate the required space.
For example, MS932 characters can be 2 bytes. When you convert MS932 characters to UTF-8, each character is
3 bytes. Oracle databases allow a maximum of 2000 bytes per database table entry. Data truncation or overflow
occurs when you write 1,000 MS932 characters to a UTF-8 Oracle database because 1,000 MS932 characters can
require 3000 bytes when the characters are encoded as UTF-8.
Example of Data Truncation
When you import a source or target definition, the port precision is set to the precision of the data in the database. In
the following figure, the source qualifier and the database have a precision or column length of 3. The database code
page is single-byte and the database client code page is multibyte. Therefore, the database client converts the source
data to multibyte before sending it to the connection object. As a result, data truncation occurs because the source
qualifier has a precision of 3, but it receives four bytes.
Figure 5. Data Truncation at the Source Qualifier
6
The following table lists each object, the code page of the object, and a description of the data that the object
processes:
Object
Code Page
Description
Source: Data
ISO-8859-1
The data is encoded as ISO-8859-1.
The Character string äbc is 3 bytes when it is encoded as IS0-8859-1.
Source: Oracle Database
ISO-8859-1
The database uses the ISO-8859-1 code page.
Oracle Database Client
UTF-8
Character conversion occurs. The database client driver reads data using the
UTF-8 code page. After character conversion, the data is encoded as UTF-8.
ä requires 2 bytes in UTF-8. Therefore, character string äbc is 4 bytes when it is
encoded as UTF-8.
Reader Connection Object
UTF-8
The reader connection object reads data using the UTF-8 code page. The reader
converts data from UTF-8 to the UCS-2 encoding.
Source Qualifier
UCS-2
The Integration Service truncates the data to fit the Source Qualifier configuration.
Troubleshooting
When code page settings are incorrect, unexpected character conversion occurs. To troubleshoot where the character
conversion occurs, review the code page settings of the database, database client, and connection objects.
When a database contains data of a different encoding than the database code page, you can migrate the data to a
different database or you can use the Integration Service to convert the data to a different encoding.
This section includes the following topics:
How to Check the Database Client Settings
How to Manage Incorrect Data Encoding Within a Database
How to Check the Database Client Settings
If unexpected character conversion occurs, verify that the database client settings match the database code page.
Oracle
For the Oracle client, you set the database client code page value with the NLS_LANG environment variable on the
machine where the Integration Service runs.
You can use one of the following commands to determine the NLS_LANG setting on the machine where the Integration
Service runs:
On UNIX: SQL> HOST ECHO $NLS_LANG
On Windows: SQL> HOST ECHO %NLS_LANG%
You can use the following command to verify that NLS_Lang is set to match code page of the database:
SELECT * FROM V$NLS_PARAMETERS WHERE PARAMETER LIKE'NLS_CHARACTERSET%'
DB2
For the DB2 client, you set the database client code page value with the DB2CODEPAGE environment variable on the
machine where the Integration Service runs.
If you are an instance owner, you can use the following command to determine the code page of the database:
db2 get db cfg for <dbname>
Sybase
For Sybase, you can set the database and database client code page with the LANG setting on the machine or the
setting in the locales.dat file. The locales.dat file is located in the $SYBASEHOME/locales directory.
You can read the locales.dat file to determine the database client code page setting.
7
Microsoft SQL Server
You can use sp_helpsort to determine the code page of the database. For more information:
http://msdn2.microsoft.com/en-us/library/aa933426(SQL.80).aspx
How to Manage Incorrect Data Encoding Within a Database
If a database contains data of incorrect encoding, you can migrate the data to a database that uses the same code
page or you can change the encoding of the data.
How to Convert Data to a Different Encoding
When the data in a database is not in the correct encoding you can use the Integration Service to convert the data to
the correct encoding.
In this example, the session reads data from a database that uses a code page that is not the same as the code page
of the data stored in the database. Character conversion does not occur when the database client reads the data from
the database because the database client and the database use the same codepage. The writer converts the data to
UTF-8 encoding before sending the data to the database client because the writer connection object uses the UTF-8
code page.
Note: Before you convert data to a different encoding, verify that the characters from the original codepage also exist
in new codepage.
Figure 6. Converting ISO- 8859-1 Data to UTF- 8 Encoding
8
The following table lists each object, the code page of the object, and a description of the data that the object
processes:
Objects
Code Page
Description
Source: Data
ISO-8859-1
The data is encoded as ISO-8859-1.
Source: Oracle Database
UTF-8
The database code page is UTF-8.
Source: Oracle Database Client
UTF-8
The database client driver reads and writes data using the UTF-8 code page.
Since the database client and the database use the same code page, no
character conversion occurs.
Reader Connection Object
ISO-8859-1
The reader connection object reads data using the ISO-8859-1 code page.
The reader converts data from ISO-8859-1 to the UCS-2 encoding.
Writer Connection Object
UTF-8
Character conversion occurs. The writer converts the data from UCS-2
encoding to the UTF-8 encoding. The writer connection object sends the
target data to the database client.
Target: Oracle Database Client
UTF-8
The database client driver writes data using the UTF-8 code page.
Target: Oracle Database
UTF-8
The database code page is UTF-8. The target table has allocated space for
the mutibyte data it receives.
Target: Data
UTF-8
The data is encoded as UTF-8.
How to Migrate Data Between Databases of Different Encoding
When data in a database is not in the same encoding as the database code page, you can use the Integration Service
to migrate the data to a database that uses the same encoding.
In this example, the session reads MS932 data from a database that uses a UTF-8 code page. Since the database
client and the source data use the same code page, no character conversion occurs when the database client retrieves
the data from the database.
9
Figure 7. Migrating MS932 Data from a UTF-8 Database to an MS932 Database
The following table lists each object, the code page of the object and a description of the data that the object
processes:
Objects
Code Page
Description
Source: Data
MS932
The data is encoded as MS932.
Source: DB2 Database
UTF-8
The database uses the UTF-8 code page.
Source: DB2 Database Client
MS932
The database client driver reads the data using the MS932 code page. No
character conversion occurs because the data is encoded MS932.
Reader Connection Object
MS932
The reader connection object receives source data from the database client. The
reader thread converts the data from MS932 to UCS-2 encoding
Writer Connection Object
MS932
The writer thread converts the data from UCS-2 to the MS932 encoding. The
writer connection object sends the target data to the database client.
Target: DB2 Database Client
MS932
The database client writes data using the MS932 code page.
Target: DB2 Database
MS932
The database uses the MS932 code page.
Target: Data
MS932
The data is encoded as MS932.
10
Author
Padma Heid
Technical Writer
Acknowledgements
The author would like to thank Wenxin He and Venu Gangu for their contributions to this article.
11