Download FAQ: How CDC refresh works

FAQ: How CDC refresh works Q: Does CDC need to stop mirroring when performing a REFRESH operation? A: Before refreshing a set of tables from source to target, the subscription must be stopped if currently in mirroring status. Once the subscription is stopped, a number of tables can be selected for a REFRESH operation. Q: What happens when mirroring is restarted for the subscription which initiated the REFRESH operation? A: When the REFRESH operation is completed successfully by CDC, the product can be restarted for mirroring. CDC will then process the backlog of changes made to the database since mirroring was stopped (which was just before the REFRESH operation began). Q: Does CDC perform REFRESH by transferring multiple tables in parallel? A: CDC can REFRESH a set of selected tables as one operation, but will only process a REFRESH for one table at a time within a single subscription. To perform parallel refresh, multiple subscriptions can be used. Q: Can a subset of tables be part of REFRESH, or do all tables and/or subscriptions need to be part of a REFRESH operation? A: CDC will REFRESH a set of tables selected within a single subscription as one operation which will run to completion. Each table is refreshed individually until all selected tables have finished REFRESH. Since this is an operation which applies within a single subscription, other subscriptions are not affected, they may continue mirroring data for different tables, or refreshing different tables as required. Q: What source database configuration or table specification would prevent the use of REFRESH? A: CDC can support REFRESH on any tables which are supported for mirroring. Q: How does CDC retrieve data for the REFRESH operation? A: CDC will query the source database table for a REFRESH operation which will cause a table scan to provide the rows. Q: Is there such as thing as tables "too large" to REFRESH? A: Practically, yes. The CDC source engine REFRESH query may cause the source database to perform significant read I/O during the REFRESH. The database may take many hours to perform a "table scan" operation to provide rows. Scanning large tables can cause significant disk I/O contention for databases which have changing data, or maintenance operations which are active during the REFRESH operation. Q: Are there alternatives for REFRESH? A: Database extract / import, backup / restore, insert into select * from (remote database linked table), operations are used by customers instead of REFRESH when these options are more suitable. Q: Does CDC perform an ORDER BY sort when retrieving data for REFRESH? A: During standard REFRESH, no ORDER BY is used, therefore the database is free to return data in any order that it best decides. During "Differential" REFRESH operations, CDC must query using an ORDER BY on the table keys (as mapped in CDC) to sort the source and target tables in order to determining the differences between the two tables. Q: Does CDC use bulk extract utilities to obtain rows for REFRESH? A: CDC uses a SQL query to obtain rows during REFRESH. CDC does not use bulk extract. Q: Can CDC query a separate backup/copy/standby database rather than the source database during a REFRESH to avoid read I/O on the source database? A: CDC queries the source database tables directly, and does not have an option to redirect queries to backup, standby or tables stored on different databases. The source table mapped for a subscription is the one queried. Q: Does CDC require source database tables to be quiescent during the Refresh? A: CDC has "REFRESH while active" logic that allows REFRESH during periods where the source database is processing changes (Insert, Update, Delete) to tables involved in the REFRESH operation. Q: What overhead does the Refresh place on CPU/Disk IO etc? A: CDC uses very little CPU on the source system during standard REFRESH operations. The majority of the processing is disk read I/O on the tables involved in the REFRESH. Q: How does CDC obtain a transactionally complete 'snapshot' for tables that have active changes being performed on them? A: Oracle specific : CDC will use a method which opens a transactional read-only snapshot query on the REFRESH operations table which will cause the undo space in the source database to be utilized during the length of the query. The DBA should review this as very large tables (>50M rows) which are changed freqeuently (>100 changes per second) can require larger amounts of undo space. A: CDC in general supports a capability known as "REFRESH while active". The source database log position at the time of the source database REFRESH query is stored as the "start point" in the CDC source metadata. When the REFRESH completes for this table, the log position of the source database is stored as the "end point" in the CDC source metadata. When mirroring is restarted for this table, changes that are made within the "start to end points" will be sent to the target with an "in-doubt" flag. The CDC target will issue the changes made during mirroring as per normal, but if an error happens, CDC target will check for the "in-doubt" flag and ignore this error because the row was already replicated during the refresh, and therefore caused an INSERT duplicate or DELETE not found or other violation. Q: When does the target table get truncated for the refresh? A1. The CDC source sends "START_REFRESH" message before selecting rows from table. The CDC target will do the truncate when received "START REFRESH" message. The CDC source then sends data records for the target table which are applied to completion or error. Q: Are there any recommendations for maintenance procedures or other operations to be commenced before a REFRESH? A: CDC will query the source database leading to increased disk read I/O for those tables which will be part of the REFRESH. CDC will apply these changes to the target database tables. If you have database maintenance procedures such as backup, re-index, or other disk intensive operations scheduled, these may cause some level of disk I/O contention. The DBA team should review the opportunity to schedule the REFRESH of large tables when it best suits the source and target databases. Q: How is data loaded into a DB2 UDB LUW database by the CDC DB2 target engine REFRESH? A: CDC DB2 uses the DB2 bulk load utility to INSERT the refreshed rows into the target database. This behavior can be changed to use a JDBC SQL based INSERT operation if the customer decides not to use bulk load. Note however that bulk load is by far the fastest method of loading refresh data into a target database in most cases. Q: How is data loaded into a Oracle database by the CDC Oracle target replication engine during a REFRESH? A: CDC Oracle uses the Oracle OCI DirectPathLoad bulk loader API to INSERT the refreshed rows into the target database. The OCI DirectPathLoad API avoids staging bulk load files on disk by utilizing in-memory loading. This behavior can be changed to use a JDBC SQL based INSERT operation if the customer decides not to use bulk load. Note however that bulk load is by far the fastest method of loading refresh data into a target database in most cases. Q: How is data loaded into a Teradata database by the CDC Teradata target replication engine during a REFRESH? A: CDC Teradata uses the Teradata FASTLOAD bulk load utility to INSERT the refreshed rows into the target database. This behavior can be changed to use a JDBC SQL based INSERT operation if the customer decides not to use bulk load. Note however that bulk load is by far the fastest method of loading refresh data into a target database in most cases. Q: How is data loaded into DB2/z databases by the CDC target replication engine during a REFRESH? A: CDC uses DB2-CLI API with SQL based batch INSERT operations to load the target table. Q: How is data loaded into DB2/400 (iSeries, IBM i) databases by the CDC target replication engine during a REFRESH? A: CDC DB2/400 populates the target table file directly using native DB2/400 I/O operations which avoid the SQL libraries. This method has the highest performance for loading data into the database. Q: How is data loaded into other databases by the CDC target replication engine during a REFRESH? A: CDC uses the native database bulk load utility to INSERT the refreshed rows into the target database. This behavior can be changed to use a JDBC SQL based INSERT operation if the customer decides not to use bulk load. Note however that bulk load is by far the fastest method of loading refresh data into a target database in most cases. Q: What order is used when performing a REFRESH with multiple tables? A. The order in which each individual table is refreshed is based on the group order. Group order is set via Management Console. If all tables have the same group order, then they'll be used as the same order they're stored in the CDC metadata. Q: How does CDC support REFRESH for tables with referential integrity? A. Use the Table Group order facility via Management Console to organize the order of tables to REFRESH to keep within the constraints imposed on the tables. At least with 6.3, I believe refresh of tables with RI is not supported with the default configuration. The user would have to set some system parameters. See JIRA JUDB-1275 and JORA-1174 Q: Does archive log only replication, read-only source database or other CDC features affect the REFRESH operation? A: There are no features of CDC which prevent restart of mirroring following a REFRESH. Q: What method is faster, non-logged bulk load or SQL INSERT based using JDBC / CLI? A: CDC will attempt to use the native bulk load interface supported by a particular database platform and release, Bulk load operations are typically not logged, and many databases have implemented short cuts to load data within the tables faster than for SQL based INSERT operations. Q: When does CDC switch from bulk load to the SQL method? A: The answer depends on platform, configuration and settings: The user can disable bulk loading through a system parameter. When table mapping is set for Live Audit, CDC does not bulk load as the audit table needs to be appended to and not re-loaded. The presence of LOB columns will cause CDC target to use JDBC loader on some platforms such as SQL Server where the bulk load interface does not support LOB. Other platforms such as Oracle support LOB during OCI DirectPathLoad. The JDBC apply will be used if user exits are configured as CDC cannot know if the user exit is referencing data in the target table which would necessitate that rows be inserted in transactions and therefore immediately visible to the user exit code. The JDBC apply will be selected if target columns have non-ASCII character names on platforms such a SQL Server where the bulk load interface has such limitations. CDC for Informix does not use a bulk loader. Q: What is the preferred REFRESH loader method on DB2/z platform? A: Typically the SQL based INSERT method of REFRESH is preferred because it is easier to configure. The bulk load interface on DB2/z requires configuration in order to be utilize the LOAD utility. The DB2/z LOAD utility requires that CDC stage the entire table to a disk file before starting LOAD. Table sizes may affect load speed depending on method chosen: For small tables, the SQL method is typically more optimal, versus the overhead of staging bulk load files and calling the loader. For medium tables which comfortably fit on DASD, the LOAD method is typically more optimal than using SQL based INSERTS. For very large tables, the significant disk resource requirement of staging the entire LOAD file may not match existing resource availability, and may not perform significantly better than the default SQL based REFRESH. Q: Does CDC drop indexes on target tables before loading rows? A: The answer depends on the platform: CDC for DB2/z and CDC for DB2/400 do not drop indexes prior to load. CDC for DB2 UDB by default does not drop indexes, but can be configured by system parameter to optionally drop and re-create indexes when bulk loading. CDC for Oracle will drop indexes prior to load and recreate them afterwards. CDC for SQL Server and CDC for Sybase will drop indexes prior to load and recreate them afterwards when using bulk loader. Q: How to reduce REFRESH time for tables with many indexes? A: CDC may drop indexes on the target table prior to loading rows depending on the load method available. When REFRESH completes, CDC will recreate any indexes it had previously dropped one at a time until all indexes are recreated. CDC then moves on to the next table to REFRESH. To optimize the loading of multiple tables, and especially those with many indexes, manually drop indexes except the primary key on the target tables prior to REFRESH. When CDC has notified that the table has finished a REFRESH operation, manually perform index re-create outside of CDC product. While some databases provide for parallel index recreation, this may result in CPU and I/O bottlenecks on the target database. Q: How does recursion prevention affect REFRESH in a bi-directional replication scenario? A: The CDC target engine writes an entry to the DM_BOOKMARK table when changing records in the target database. When the CDC recursion prevention feature is enabled, the CDC log scraper detects when a transaction contains a change to the DM_BOOKMARK table, and discards these transactions which originated from CDC. The only time CDC does not write an entry to the DM_BOOKMARK table during replication is during REFRESH bulk load operations which are not logged and therefore would not be replicated by CDC. Q: What is "Differential Refresh" and how does it work? A: Standard REFRESH method will truncate the target table before bulk loading rows. Differential method available as of CDC 6.2+ keeps the target table online during the REFRESH operation. Mirroring is stopped during refresh (same as standard refresh) -Available for “standard replication” table mappings -Target table remains online during refresh (for reporting / other uses) -Requires same table structure for source and target -Rows are compared by sort order of “key” columns -Entire table is processed (no subset) -All source table rows are sent to target Repair differences between source and target -Rows that are on source but not on target -Rows that are on target but not on source -Rows that differ in contents between source and target -Differences can be logged to audit log table -CDC by default apply merges rows into target table (can be configured optionally to only audit the changes) -Uses SQL apply (not bulk load) User interface -Management Console option on refresh -Command line -Table by table -User initiated Q: What additional REFRESH capabilities may become available in future CDC product releases? A: CDC source and target (in a future release) may automatically detect and support the case where during a REFRESH operation on a table, primary keys are modified by an UPDATE statement on the table. The scenario of UPDATE changing a primary key value is rare and has only ever been reported by customers a few times. The workaround is to refresh such tables when the affected database tables are idle, meaning, no UPDATE on keys being performed. A: CDC source and target (in a future release) may support refreshing a subset of the table via a WHERE clause that can be specified to reduce the amount of data replicated during REFRESH. This is useful for tables with many partitions, where only the newest data needs to be refreshed due to an operational issue related to data which was newly replicated / in-scope. A: CDC target (in a future release) may contain multiple parallel threads to improve refresh performance for large tables.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download FAQ: How CDC refresh works