Download Understanding Code Pages and Character Conversion

Understanding Code Pages and Character Conversion © 2008-2009 Informatica Corporation Abstract Code page character conversion can occur when data passes between databases, database clients, and PowerCenter connection objects that do not use the same code pages. Character conversion can result in data corruption, data truncation, or data overflow. To avoid unexpected character conversion, it is important to understand how code pages affect the characters you process. This article describes how to configure codepages for database clients and PowerCenter connection objects. It also explains where character conversion occurs and the impact of character conversion. Table of Contents Overview ........................................................................................................................................................................... 2 Configuring Code Page Settings....................................................................................................................................... 3 Configuring the Connection Object Code Page ................................................................................................................ 4 Configuring the Reader Connection Object.................................................................................................................. 4 Configuring the Writer Connection Object .................................................................................................................... 4 Configuring the Database and Database Client Code Page............................................................................................. 5 Writing Data to a Database .......................................................................................................................................... 5 Reading Data from a Database .................................................................................................................................... 5 Data Truncation ................................................................................................................................................................ 5 Example of Data Truncation ......................................................................................................................................... 6 Troubleshooting ................................................................................................................................................................ 7 How to Check the Database Client Settings................................................................................................................. 7 How to Manage Incorrect Data Encoding Within a Database ...................................................................................... 8 Overview Code pages specify the encoding of characters by mapping each character to a hexadecimal value. Code page character conversion can occur when data passes between databases, database clients, and PowerCenter connection objects that do not use the same code page. When the code pages are not the same, characters can convert to incorrect or undefined values. The result of character conversion can be data corruption, data truncation, or data overflow. Character conversion can happen during a PowerCenter session. When the PowerCenter sessions read and write data in Unicode data movement mode, the following steps occur: 1. The Integration Service spawns a reader thread to read the data. The reader thread uses a connection object to connect to the database client. 2. The database client reads data from the database. If the database client and the database are not compatible, unexpected character conversion can occur. 3. The database client sends the data to the reader connection object. 4. The reader thread converts the data from the code page of the reader connection object to the UCS-2 code page. If the reader connection object and the source data use different code pages, unexpected character conversion can occur. 5. The Integration Service processes the data and sends the target data to the writer thread. 6. The writer thread converts data from UCS-2 code page to the code page of the writer connection object. The writer connection object sends target data to the database client. 7. The database client writes data to the database. If the database client codepage is not compatible with the database code page, unexpected character conversion can occur. To avoid unexpected character conversion, use the same code page for the PowerCenter connection object, the source/target data, the database client, and the database. 2 Configuring Code Page Settings You can set code page values for the following objects: PowerCenter connection object. Integration Service uses the connection object code page to read or write data. Configure the connection object code page in the session connection object. Source/target database character set. The database code page specifies the code page of the character data within the database. Configure the code page of the database when you create it. Database client. The database client driver uses the code page of the database client to read data from and write data to the database. Configure the code page of the database client on each machine that runs the Integration Service. When you configure the connection objects, database, and database client to use the same code as the data, unexpected character conversion does not occur. In Figure 1, each object in the session uses the same codepage: Figure 1. No Unexpected Character Conversion Occurs Since the Integration Service processes data internally using the UCS-2 code page, the following expected data conversion occurs: The reader thread converts source data to the UCS-2 encoding. The writer thread converts target data from UCS-2 to the code page of the writer connection object. 3 Configuring the Connection Object Code Page When the Integration Service runs in Unicode data movement mode, the Integration Service processes data internally using the UCS-2 code page. The reader and writer threads convert the data to and from UCS-2 encoding, based on the code page specified in the connection object. Character conversion can occur when the Integration Service runs in Unicode data movement mode and the data does not use the same code page as the connection object. Configure the code page for the connection object code page to be the same as the code page of the data that it reads or writes. For example, if the database client converts the ISO8859-1 data from an ISO-8859-1 database to UTF-8, set the reader connection object code page to UTF-8. If you read ISO-8859-1 encoded data from a UTF-8 database and the database client uses the UTF-8 code page, set the reader connection object code page to ISO-8859-1. Configuring the Reader Connection Object The reader connection object code page should be the same as the code page of the data that it reads. When the code page specified in the reader connection object does not match the code page that the source data is encoded with, the UCS-2 data that the Integration Service processes can be corrupted. In Figure 2, data corruption occurs because the reader connection object uses the MS932 code page but the data is encoded as ISO-8859-1: Figure 2. Reader Converts Data Incorrectly In Figure 2, the reader converts the source data from ISO-8859-1 encoding to UCS-2 encoding using the MS932 code page. Since the code page is not associated with the encoding of the data that the reader receives, the Integration Service may process data that contain incorrect or undefined values. Configuring the Writer Connection Object Configure the writer connection object code page to the same code page as the data that it writes. To write target data in the same encoding as the source data, configure the writer connection object code page to the same code page as the reader connection object. To write target data in a different encoding than the source data, configure the writer connection objects code page to the code page value as the data you want to write. In Figure 3, character conversion occurs because Integration Service reads data encoded as ISO-8859-1 but writes the data encoded as UTF-8: Figure 3. Writer Converts Data to a Different Encoding In Figure 3, the reader thread converts the data from ISO-8859-1 encoding to UCS-2 encoding and the writer thread converts the data from UCS-2 encoding to UTF-8 encoding. The Integration Service reader and writer threads convert 4 the character values without data corruption. However, the byte size of each character can increase when the writer thread converts data, since UTF-8 is a multibyte code page. Increased byte size can cause data truncation or overflow. Note: When you convert data to a different encoding, verify that the characters from the original codepage also exist in new codepage. Configuring the Database and Database Client Code Page Configure the database client code page to the same code page as the database. Character conversion can occur when the database client code page and the database are not compatible. The Integration Service uses the database client to read and write data to the database. You configure the database client code page on the machine where the Integration Service runs. If you have multiple databases of the same type that require different database client code pages, create an Integration Service for each database client. In a session, sources and targets that are of the same database type share the same database client. Therefore, the source database client code page and the target database client code pages use the same code page. Writing Data to a Database When the database client and the database use the same code page, the database client writes data to the database in the code page of the writer connection object. When the database client and the database do not use the same code page, the database client converts the data to the code page of the database. In Figure 4, character conversion occurs because the database client code page and the database code page do not use the same code page: Figure 4. Database Client Converts Data to the Encoding of the Database The database client converts the target data from UTF-8 encoding to ISO-8859-1 encoding before it writes the target data to the database. As a result, the database client writes the target data encoded as ISO-8859-1 instead of the code page of the data that it receives from the writer connection object. Reading Data from a Database When the database and the database client use the same code page, the database client does not convert data that it reads to the code page of the database client. Therefore, if you want to read data from a database that uses a different code page than the data, verify that database client and the database use the same code page. When the database and the database client do not use the same code page, the database client converts the data to the code page of the database client before it sends the data to the reader connection object. Data Truncation Data truncation can occur as a result of character conversion. When character conversion occurs, the number of bytes required to store the same data may change. For example, UTF-8 characters use one to four bytes and ISO-8859-1 characters use one byte. Data truncation can occur in the following situations: Reading data. Data truncation occurs when the Integration Service reader receives more bytes from a source than what is configured in the source qualifier. 5 Writing data. Data truncation or overflow occurs when the database client attempts to write more bytes to the target than what is allocated in the precision of the target database table. Use the following guidelines to avoid data truncation or overflow: Verify that the source qualifier uses the same precision as the source data that it processes. Verify that the target definition uses the same precision as the target database table. When you migrate single byte data to a multibyte database, set the precision in the target database to three times the size of the single byte data. When you convert data to a different encoding, verify that the target database tables allocate the required space. For example, MS932 characters can be 2 bytes. When you convert MS932 characters to UTF-8, each character is 3 bytes. Oracle databases allow a maximum of 2000 bytes per database table entry. Data truncation or overflow occurs when you write 1,000 MS932 characters to a UTF-8 Oracle database because 1,000 MS932 characters can require 3000 bytes when the characters are encoded as UTF-8. Example of Data Truncation When you import a source or target definition, the port precision is set to the precision of the data in the database. In the following figure, the source qualifier and the database have a precision or column length of 3. The database code page is single-byte and the database client code page is multibyte. Therefore, the database client converts the source data to multibyte before sending it to the connection object. As a result, data truncation occurs because the source qualifier has a precision of 3, but it receives four bytes. Figure 5. Data Truncation at the Source Qualifier 6 The following table lists each object, the code page of the object, and a description of the data that the object processes: Object Code Page Description Source: Data ISO-8859-1 The data is encoded as ISO-8859-1. The Character string äbc is 3 bytes when it is encoded as IS0-8859-1. Source: Oracle Database ISO-8859-1 The database uses the ISO-8859-1 code page. Oracle Database Client UTF-8 Character conversion occurs. The database client driver reads data using the UTF-8 code page. After character conversion, the data is encoded as UTF-8. ä requires 2 bytes in UTF-8. Therefore, character string äbc is 4 bytes when it is encoded as UTF-8. Reader Connection Object UTF-8 The reader connection object reads data using the UTF-8 code page. The reader converts data from UTF-8 to the UCS-2 encoding. Source Qualifier UCS-2 The Integration Service truncates the data to fit the Source Qualifier configuration. Troubleshooting When code page settings are incorrect, unexpected character conversion occurs. To troubleshoot where the character conversion occurs, review the code page settings of the database, database client, and connection objects. When a database contains data of a different encoding than the database code page, you can migrate the data to a different database or you can use the Integration Service to convert the data to a different encoding. This section includes the following topics: How to Check the Database Client Settings How to Manage Incorrect Data Encoding Within a Database How to Check the Database Client Settings If unexpected character conversion occurs, verify that the database client settings match the database code page. Oracle For the Oracle client, you set the database client code page value with the NLS_LANG environment variable on the machine where the Integration Service runs. You can use one of the following commands to determine the NLS_LANG setting on the machine where the Integration Service runs: On UNIX: SQL> HOST ECHO $NLS_LANG On Windows: SQL> HOST ECHO %NLS_LANG% You can use the following command to verify that NLS_Lang is set to match code page of the database: SELECT * FROM V$NLS_PARAMETERS WHERE PARAMETER LIKE'NLS_CHARACTERSET%' DB2 For the DB2 client, you set the database client code page value with the DB2CODEPAGE environment variable on the machine where the Integration Service runs. If you are an instance owner, you can use the following command to determine the code page of the database: db2 get db cfg for <dbname> Sybase For Sybase, you can set the database and database client code page with the LANG setting on the machine or the setting in the locales.dat file. The locales.dat file is located in the $SYBASEHOME/locales directory. You can read the locales.dat file to determine the database client code page setting. 7 Microsoft SQL Server You can use sp_helpsort to determine the code page of the database. For more information: http://msdn2.microsoft.com/en-us/library/aa933426(SQL.80).aspx How to Manage Incorrect Data Encoding Within a Database If a database contains data of incorrect encoding, you can migrate the data to a database that uses the same code page or you can change the encoding of the data. How to Convert Data to a Different Encoding When the data in a database is not in the correct encoding you can use the Integration Service to convert the data to the correct encoding. In this example, the session reads data from a database that uses a code page that is not the same as the code page of the data stored in the database. Character conversion does not occur when the database client reads the data from the database because the database client and the database use the same codepage. The writer converts the data to UTF-8 encoding before sending the data to the database client because the writer connection object uses the UTF-8 code page. Note: Before you convert data to a different encoding, verify that the characters from the original codepage also exist in new codepage. Figure 6. Converting ISO- 8859-1 Data to UTF- 8 Encoding 8 The following table lists each object, the code page of the object, and a description of the data that the object processes: Objects Code Page Description Source: Data ISO-8859-1 The data is encoded as ISO-8859-1. Source: Oracle Database UTF-8 The database code page is UTF-8. Source: Oracle Database Client UTF-8 The database client driver reads and writes data using the UTF-8 code page. Since the database client and the database use the same code page, no character conversion occurs. Reader Connection Object ISO-8859-1 The reader connection object reads data using the ISO-8859-1 code page. The reader converts data from ISO-8859-1 to the UCS-2 encoding. Writer Connection Object UTF-8 Character conversion occurs. The writer converts the data from UCS-2 encoding to the UTF-8 encoding. The writer connection object sends the target data to the database client. Target: Oracle Database Client UTF-8 The database client driver writes data using the UTF-8 code page. Target: Oracle Database UTF-8 The database code page is UTF-8. The target table has allocated space for the mutibyte data it receives. Target: Data UTF-8 The data is encoded as UTF-8. How to Migrate Data Between Databases of Different Encoding When data in a database is not in the same encoding as the database code page, you can use the Integration Service to migrate the data to a database that uses the same encoding. In this example, the session reads MS932 data from a database that uses a UTF-8 code page. Since the database client and the source data use the same code page, no character conversion occurs when the database client retrieves the data from the database. 9 Figure 7. Migrating MS932 Data from a UTF-8 Database to an MS932 Database The following table lists each object, the code page of the object and a description of the data that the object processes: Objects Code Page Description Source: Data MS932 The data is encoded as MS932. Source: DB2 Database UTF-8 The database uses the UTF-8 code page. Source: DB2 Database Client MS932 The database client driver reads the data using the MS932 code page. No character conversion occurs because the data is encoded MS932. Reader Connection Object MS932 The reader connection object receives source data from the database client. The reader thread converts the data from MS932 to UCS-2 encoding Writer Connection Object MS932 The writer thread converts the data from UCS-2 to the MS932 encoding. The writer connection object sends the target data to the database client. Target: DB2 Database Client MS932 The database client writes data using the MS932 code page. Target: DB2 Database MS932 The database uses the MS932 code page. Target: Data MS932 The data is encoded as MS932. 10 Author Padma Heid Technical Writer Acknowledgements The author would like to thank Wenxin He and Venu Gangu for their contributions to this article. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Understanding Code Pages and Character Conversion