Download the presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Commitment ordering wikipedia , lookup

Transcript
GoldenGate Performance Tuning
Tips & Techniques
Gavin Soorma
Agenda
• What is Lag and what can contribute to lag in a GoldenGate replication
environment
• Compare Classic Extracts and Replicats with Integrated Extracts and Replicats
• New performance tuning challenges introduced by the Log Mining Server
component
• What tools do we have available in OGG 12.2 to monitor performance
• Using those tools to examine and investigate a real-life performance problem and
how the problem was resolved
Oracle GoldenGate Architecture
Where is the problem?
x
x
x
x
x
x
x
Is the problem because of a Goldengate component?
• Extract in reading the archive log and writing the data to a trail
(or remote host)
• Datapump reading the extract trail and writing to a remote host
• Network
• Collector (server.exe) on the target receiving network data and
writing it to a local trail
• Replicat reading the local trail and writing to the database
• Logmining Server issues – both source as well as target
Measuring OGG Performance
• Typically a GoldenGate performance problem is centered
around Lag
• LAG is the elapsed time between when a transaction is
committed and written to a storage medium such as an archive
log or redo log on the source and the time when Replicat writes
the same transaction to the target database
Classic Extract
Integrated Extract
Logmining Server
•Reader: Reads logfile and splits into regions
•Preparer: Scans regions of logfiles and prefilters
based on extract parameters
•Builder: Merges prepared records in SCN order
•Capture: Formats Logical Change Records(LCRs)
and passes to GoldenGate Extract
Extract
•Requests LCRs from logmining server
•Performs Mapping and Transformations
•Writes Trail File
Classic Replicat
Integrated Replicat
Replicat
•Reads the trail file
•Constructs logical change records (LCRs)
•Transmits LCRs to Oracle Database via the
Lightweight Streaming API
Inbound Server (Database Apply Process)
•Receiver: Reads LCRs
•Preparer: Computes the dependencies between the transactions (primary key, unique
indexes, foreign key) , grouping transactions and sorting in dependency order.
•Coordinator: Coordinates transactions, maintains the order between applier processes.
•Applier: Performs changes for assigned transactions, including conflict detection and
error handling.
Do we still use Classic Extracts and Replicats?
• Any reason why we are not using BOTH
• Integrated Extracts
• Integrated Replicats
• Do we have source/target Oracle databases on versions less than 11.2.0.3 or
11.2.0.4?
• Consider Downstream Capture if Integrated Extracts not allowed on the source
because it is ‘invasive’
• Do we use RAC, ASM, TDE?
• Do we want RMAN integration with Oracle GoldenGate?
A case for Integrated Replicat
• Integrated Replicat offers automatic parallelism which automatically increases or
decreases the number of apply processes based on the current workload and
database performance
• Co-ordinated replicat provides multiple threads, but dependent objects had to be
handled by the same replicat thread – otherwise Replicat will abend
• Integrated Replicat ensures referential integrity and DDL/DML operations are
automatically applied in the correct order
• Management and tuning of Replicat performance is simplified since you do not
have to manually configure multiple Replicat processes to distribute the tables
between them.
• Tests have shown that a single Integrated Replicat can out-perform multiple Classic
Replicats as well as multi-thread Co-ordinated Replicat
Tune the database before tuning GoldenGate!
• Is the target database already having I/O issues?
• Are the redo logs properly configured – size and location?
• Data replication is I/O intensive, so fast disks are important, particularly for the online redo
logs.
• Redo logs are constantly being written to by the database as well as being read by
GoldenGate Extract processes
• Do we have any significant ‘Log File Sync’ wait events?
• Also consider the effect of adding supplemental logging which will increase the redo logging
Key Points
• Identify and isolate tables with significantly high DML activity
• Separate Extract and Replicat process groups for such tables
• Dedicated Extract and Replicat process groups for tables with LOB columns
• Possibly dedicated process groups for tables with long running transactions
• Run the Oracle GoldenGate database Schema Profile check script to identify tables with missing
PKs/UKs/Deferred Constraints/NOLOGGING/Compression
• Start with a single Replicat process (as well as Extract process)
• Add replicat processes until latency is acceptable (Classic)
Key Points
• In its classic mode, Replicat process can be a source of performance bottlenecks because it
is a single-threaded process that applies operations one at a time by using regular SQL
• Consider BATCHSQL to increase performance of Replicat particularly in OLTP type
environments characterized by smaller row changes in terms of data
• BATCHSQL causes Replicat to organize similar SQL statements into arrays which leads to
faster processing as opposed to serial apply of SQL statements
• If tables can be separated based on PK/FK relationships consider Co-Ordinated replicats
with multiple threads
• For Integrated Replicats check the parameters PARALLELISM, MAX_PARALLELISM,
COMMIT_SERIALIZATION, EAGER_SIZE
Tune the Network for OGG
• The network is an important component in GoldenGate replication
• The two RMTHOSTparameters, TCPBUFSIZE and TCPFLUSHBYTES are very useful for
increasing the buffer sizes and network packets sent by Data Pump over the network
from the source to the target system.
• This is especially beneficial for high latency networks
• Use Data Pump compression if network bandwidth is constrained and when CPU
headroom is available
Tuning the Network - Before
GGSCI (ti-p1-bscs-db-01) 1> send pbsprd2 gettcpstats
Sending GETTCPSTATS request to EXTRACT PBSPRD2 ...
RMTTRAIL ./dirdat/rt000113, RBA
Buffer Size 2266875
38351713
Flush Size 2266875
SND Size 2097152
Streaming Yes
Inbound Msgs
Outbound Msgs
2710
Bytes
54259,
3 bytes/second
20541
Bytes
13539482811,
795925 bytes/second
Recvs
5420
Sends
20541
Avg bytes per recv
Avg bytes per send
10, per msg
659144, per msg
20
659144
Recv Wait Time
1558113382, per msg
574949, per recv
287474
Send Wait Time
7514461569, per msg
365827, per send
365827
Tuning the Network - After
GGSCI (pl-p1-bscs-db-01) 12> send pbsprd1 gettcpstats
Sending GETTCPSTATS request to EXTRACT PBSPRD1 ...
RMTTRAIL ./dirdat/rt000000, RBA
Buffer Size 200000000
98558417
Flush Size 200000000
SND Size 134217728
Streaming Yes
Inbound Msgs
Outbound Msgs
258
Bytes
4746,
1 bytes/second
2402
Bytes
98675058,
37893 bytes/second
Recvs
516
Sends
2402
Avg bytes per recv
9, per msg
18
Avg bytes per send
41080, per msg
41080
Recv Wait Time
63143512, per msg
244742, per recv
122371
Send Wait Time
486941, per msg
202, per send
202
Compare it with the earlier figures
Recv Wait Time
1558113382, per msg
574949, per recv
287474
Send Wait Time
7514461569, per msg
365827, per send
365827
Allocate memory for the Log Mining Server
• Set the STREAMS_POOL_SIZE initialization parameter for the database
• Set the MAX_SGA_SIZE parameter for both Integrated Extracts and Integrated
Replicats
• Controls amount of memory used by logmining server – default is 1 GB
• STREAMS_POOL_SIZE= (MAX_SGA_SIZE * PARALLELISM) + 25% head room
For example, using the default values for the MAX_SGA_SIZE and PARALLELISM
parameters:
( 1GB * 2 ) * 1.25 = 2.50GB
STREAMS_POOL_SIZE = 2560M
Allocate memory for the Log Mining Server
• Log mining Server is running on both source as well as target
• STREAMS_POOL_SIZE needs to be properly sized on IE as well as IR end
SQL> SELECT state FROM GV$GG_APPLY_RECEIVER;
STATE
---------------------------------------------Waiting for memory
SQL> show parameter streams
NAME
TYPE
VALUE
------------------------------------ ----------- -----------------------------streams_pool_size
big integer 2G
SQL> alter system set streams_pool_size =24G sid='bsprd1' scope=both;
System altered.
SQL> SQL> SELECT state FROM GV$GG_APPLY_RECEIVER;
STATE
---------------------------------------------Enqueueing LCRS
• Typically a GoldenGate performance problem is centered around Lag
• LAG is the elapsed time between when a transaction is committed
and written to a storage medium such as an archive log or redo log on
the source and the time when Replicat writes the same transaction to
the target database
Automatic Heartbeat Tables
GGSCI LAG, REPORT RATE
AWR report now have section for GoldenGate
Use ASH and ASH Analytics to diagnose an OGG
performance problem
Automatic Heartbeat Table
NEW
OGG 12.2
• Heartbeat Tables were recommended but involved a fair
bit of work to setup and configure
• Single 12.2 command – ADD HEARTBEATTABLE
• Record End-to-End Replication Lag in Tables
• Creates database level tables, views and jobs
• GG_LAG view – INCOMING_LAG, OUTGOING_LAG for
bi-directional replication
• GG_LAG_HISTORY – retains historical lag information
until purged
Automatic Heartbeat Table
GG_LAG
GG_LAG_HISTORY
GG_HEARTBEAT
GG_HEARTBEAT_HISTORY
How much is the lag?
Which process is responsible for the lag?
OGG 12.2
https://java.net/projects/oracledi/downloads/download/GoldenGate/OGGPTK.jar
Fine grained performance monitoring window which can be accessed
through the RESTful Web Services
Integrated Extract/Replicat Health Check
• GoldenGate Integrated Capture and Integrated Replicat Healthcheck Script (Doc
ID 1448324.1)
• Available for both Oracle 12c as well as 11g (> 11.2.0.3)
• Script generated in HTML format
• Unlike AWR report , report not for a period of time but as is snapshot – so run
when performance is worst!
SQL> spool /tmp/ogg_perf.html
SQL> @icrhc_11204.sql
-- Output will appear
SQL> exit
Integrated Extract/Replicat Health Check
• Comprehensive point-in-time snapshot of the Database as well as individual components
of Integrated Extract and Integrated Replicat.
• Database Configuration – Key init.ora parameters like STREAMS_POOL_SIZE
• Wait Event Analysis – Identify root cause of slow extracts/replicats
• Extract and Replicat Configuration – Parameters used
• Extract and Replicat Statistics – identify tables with most DML activity
Streams Performance Advisor Package
• Has been around since Oracle Streams days
• Also known as SPADV
• Install the UTL_SPADV package
• The UTL_SPADV PL/SQL package provides subprograms to collect and analyze
statistics for the LogMiner server processes.
• The statistics help identify any current areas of contention such as CPU or I/O.
@$ORACLE_HOME/rdbms/admin/utlspadv.sql
SPADV
• Gather statistics for a 30-60 minute time period during which you are troubleshooting
performance.
• Also gather statistics during a 30-60 minute time period where performance is good,
serving as a baseline comparison.
• To gather statistics every 60 seconds, run the following SQL*Plus command as the
Oracle GoldenGate administrator:
SQL> exec UTL_SPADV.START_MONITORING(interval=>60);
• To stop statistics gathering, run the following command:
SQL> exec UTL_SPADV.STOP_MONITORING;
• To view SPADV statistics:
SQL> set serveroutput size 50000
SQL> exec utl_spadv.show_stats;
Interpreting SPADV Output
PARALLELISM changed from EE default value of 2 to 1
LMP is Log Miner Preparer Process
• CPU utilization has gone down from 100% to 70% (140%/2)
• Extract throughput has gone up from 129851 messages processed to 169361
Performance Tuning Real-life Example
Active-Active Bi-Directional Replication
20 GB redo generation per hour
18 million Logical Change Records per hour
Batch job on source loading 100000
customer records took ~ 10 minutes
Replication on the target took over 30
minutes
SLA < 5 minutes lag
Initial Investigation Conclusions
• Integrated Replicat issues
• Not constrained by CPU
• Not constrained by Trail File I/O
• Disabled FK’s and tested with Co-Ordinated Replicat
• Performance was good – so that ruled out the network or the Extract side
of things
• Possibly due to Integrated Apply processes
• Apply Reader
• Apply co-ordinator
• Apply Server/Servers
ASH Analytics
ASH Analytics
ASH Analytics
ASH Analytics
Lets look at some SPADV output
PATH 4 RUN_ID 78 RUN_TIME 2015-SEP-25 00:13:14 CCA Y
|<R> RBSPRD2 3737 1371119 0 1.7% 93.3% 3.3% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2"
3737 0.01 4494 |<A> OGG$RBSPRD2 3734 484 -1 APR 1.7% 95% 3.3% "" APC 98.3% 0%
1.7% "" APS (6) 198.3% 0% 191.7% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS
6209 7869 53.3% "REPL Apply: dependency"
PATH 4 RUN_ID 79 RUN_TIME 2015-SEP-25 00:14:14 CCA Y
|<R> RBSPRD2 4141 1517685 0 1.7% 90% 6.7% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2" 4141
0.01 5001 |<A> OGG$RBSPRD2 4161 570 -1 APR 1.7% 93.3% 5% "" APC 96.7% 0% 3.3% ""
APS (6) 190% 0% 195% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS 22142 10596
38.3% "REPL Apply: dependency"
PATH 4 RUN_ID 80 RUN_TIME 2015-SEP-25 00:15:14 CCA Y
|<R> RBSPRD2 4234 1569723 0 3.3% 88.3% 8.3% "" |<Q> "OGGSUSER"."OGGQ$RBSPRD2"
4244 0.01 5001 |<A> OGG$RBSPRD2 4233 549 -1 APR 3.3% 90% 6.7% "" APC 95% 0% 5%
"" APS (6) 198.3% 0% 210% "REPL Apply: dependency" |<B> OGG$RBSPRD2 APS 19183
24681 55.% "REPL Apply: dependency“
View the Integrated Health Check Report
We have a problem …
APPLY#
SERVER_ID STATE
TOTAL_MESSAGES_APPLIED
---------- ---------- -------------------- ----------------------
5
9 WAIT DEPENDENCY
261519
5
10 WAIT DEPENDENCY
139849
5
1 WAIT DEPENDENCY
281381
5
2 WAIT DEPENDENCY
203907
5
3 WAIT DEPENDENCY
278303
5
4 WAIT DEPENDENCY
296481
5
5 EXECUTE TRANSACTION
222312
5
6 WAIT DEPENDENCY
292009
5
7 INACTIVE
202222
5
8 INACTIVE
111042
• At any given time we see only one Apply Server executing transactions
• Rest are all in WAIT DEPENDENCY state
• When Apply Server currently executing transaction completes, one of the others
which is waiting starts executing transactions
• Relates to the ASH Analytics investigation which showed the main wait event as
REPL Apply: Dependency
Get additional information from AWR Report
Do we have a ‘big’ transaction ?
Large transactions and EAGER_SIZE
• Goldengate considers a transaction to be large if it changes more than 15100 rows in a table
(changed in version 12.2. It used to a value of 9500 in earlier versions)
• An important parameter enforces how Goldengate applies these “large” transactions. It is
called EAGER_SIZE
• Sets a threshold for the size of a transaction (in number of LCRs) after which Oracle
GoldenGate starts applying data before the commit record is received.
• In essence for Oracle GoldenGate it means when I see a large number of LCR’s in a
transaction, do I start applying them straight away (that I guess is where the “eager” part of the
parameter name is derived from) or do I wait for the entire transaction to be committed and
only then start applying changes
• This “waiting” seems to serialize the apply process and adds to the apply lag on the target in a
big way
View the Integrated Health Check Report
Note the Transaction ID of transaction being executed by the only apply server in state EXECUTE TRANSACTION
AS05: 83.19.44854
Transaction 8.17.18382 is waiting on 95.3.40904 to complete
Transactions 29.25.246732, 89.2.45500 and 95.3.40904 are waiting on 109.24.24253
Transaction 109.24.24253 is waiting on 46.13.28116
Transaction 46.13.28116 is waiting on 105.27.24651
Transaction 105.27.24651 is waiting on 83.19.44854 which is the only transaction currently executing
Now that’s better!
APPLY#
SERVER_ID STATE
TOTAL_MESSAGES_APPLIED
---------- ---------- -------------------- ---------------------5
9 EXECUTE TRANSACTION
272374
5
10 EXECUTE TRANSACTION
150630
5
1 EXECUTE TRANSACTION
292175
5
2 EXECUTE TRANSACTION
225412
5
3 EXECUTE TRANSACTION
289161
5
4 EXECUTE TRANSACTION
317736
5
5 EXECUTE TRANSACTION
240507
5
6 EXECUTE TRANSACTION
302893
5
7 INACTIVE
202222
5
8 INACTIVE
111042
DBOPTIONS INTEGRATEDPARAMS (eager_size 25000)
To Wrap Up …..
• Replication of ‘batch’ type transactions needs special considerations as
opposed to replication of ‘oltp’ type transactions
• A GoldenGate performance problem is not always related to GoldenGate
• Tune the database, operating system and network first
• Using the Integrated Extract and Replicats adds an additional log mining
server component which presents it’s own separate tuning challenges
• Consider all the performance tuning tools and options available
Thanks for attending!
http://gavinsoorma.com
[email protected]