Download Active Data Guard at CERN

Emil Pilecki Credit: Luca Canali, Marcin Blaszczyk, Steffen Pade Agenda • About CERN • Oracle and Data Guard at CERN • DG perks and benefits • Zero data loss over long distances (far sync) • Far sync testing results About CERN • European Organization for Nuclear Research founded in 1954 • 21 member states, 2 candidates, 6 observers + UNESCO and UE • 60 Non-member States collaborate with CERN • 2500 staff members and 10 000 scientists 3 LHC and Experiments • Large Hadron Collider (LHC) – particle accelerator collides beams at very high energy • 27 km long circular tunnel • Located ~100m underground • Protons travel at 99.9999972% the speed of light • Collisions are analysed with usage of special detectors and software in the experiments dedicated to LHC • New particle discovered! • Consistent with the Higgs Boson • Announced on July 4th 2012 4 Oracle at CERN • Since 1982, version 2.3 • Oracle DBs play a key role in the LHC production chains • Accelerator logging and monitoring systems • Online acquisition, offline data (re)processing, data distribution, analysis • Grid infrastructure and operation services • Monitoring, dashboards, etc. • Data management services • File catalogues, file transfers, etc. • Metadata and transaction processing for tape storage system • Administrative services 5 CERN’s Databases • Over 100 Oracle databases, mostly RAC • NAS storage plus some SAN with ASM • ~400 TB of data files for production DBs • Examples of CERN’s critical DBs: • LHC logging database ~170 TB, expected growth up to 70 TB / year • 13 Production experiments’ databases ~140 TB in total • 15 production systems protected with Data Guard • Active Data Guard since 11g 6 Our Data Guard architecture Maximum performance Maximum performance Active Data Guard for read only workloads Redo Transport Primary Database Active Data Guard for read only workloads and disaster recovery Primary Database Active Data Guard for disaster recovery 1. Low load ADG 2. Busy & critical ADG LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL ASYNC NOAFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<standby_db_unique_name>’ 7 (Active) Data Guard benefits Features and functionalities we profit from: • Data protection for disaster recovery • Replication and offloading read only workload • Database backups from standby • Safeguard logical data corruptions with flashback • Snapshot standby for testing • Fast upgrades and hardware migrations • Detection of lost writes • Automatic block media recovery 8 Disaster recovery • We have been using it since a few years • Switchover/failover is our first line of defence • Saved the day already for production services • Current disaster recovery site at 10 km away from our main datacentre • Remote site in Hungary to be used soon • Over 1000km distance • Network latency of 25ms is a challenge • Plan to move most of the standby databases there within 1 year 9 Offloading production databases • Efficient replication of the whole database • Workload distribution • Transactional workload runs on primary • Read-only workload can be moved to ADG • Read-mostly workload: DMLs can be redirected to primary with a dblink • Database backups from standby • Significantly reduces load on primary by • Removes sequential I/O of full backup • ADG allows usage of block change tracking for fast incremental backups 10 Flashback and snapshot standby • Flashback enabled on standby only • Recover from human errors and data corruptions • Avoid impacting primary database with flashback logs generation • Snapshot standby • Testing changes before implementing them on primary • Safe – redo is still sent to standby • Very easy to use SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY; SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY; 11 Fast upgrades and migrations 4 2 5 1 6 3 DATABASE downtime RDBMS upgrade Upgrade complete! Clusterware 11g + RDBMS 11g Redo RedoTransport Transport Clusterware Clusterware 12c 12c ++ RDBMS RDBMS 12c 11g 12 Fast upgrades and migrations • Risk mitigation • • • • Fresh installation of the new clusterware Old system stays untouched Allows full upgrade test Allows stress testing of new system • Downtime reduction • ~ 1h for RDBMS upgrade • Additional hardware required unless migration to new one is expected anyway 13 Lost write detection and ABMR Slave exiting with ORA-752 exception Errors in file /ORA/dbs0a/PDBR_RAC50/diag/rdbms/pdbr_rac50/PDBR1/trace/PDBR1_pr0l_92600.trc: ORA-00752: recovery detected a lost write of a data block ORA-10567: Redo is inconsistent with data block (file# 67, block# 57976209, file offset is 2494701568 bytes) ORA-10564: tablespace STRMMON ORA-01110: data file 67: '/ORA/dbs03/PDBR_RAC50/datafile/STRMMON_67.dbf' ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 435213427 Mon Apr 14 06:52:02 2014 Recovery Slave PR0L previously exited with exception 752 • Stops redo application when a lost write is detected • Previous consistent block version still on standby • Helps to diagnose and repair the error • Automatic Block Media Recovery with ADG • Fixes physical block corruptions • Works both ways: Primary  ADG 14 Zero data loss replication • Use synchronous redo transport method LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL SYNC AFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<standby_db_unique_name>’ • DML statements impacted due to commit acknowledgment on standby Redo Transport Commit Ack Primary Database Network latency matters!!! Data Guard Standby 15 Far Sync concepts Long distances = high network latency = slow commit acknowledge with SYNC redo transport 25 ms Redo Transport 16 Far Sync testing at CERN Redo Redo Transport 25 ms Far Sync • Functional • Does it work? Are there any bugs? • Performance • Simulated heavy DML workload with and without Far Sync • Oracle Real Application Testing – workload captured from production databases 17 Far Sync testing results Redo Redo Transport 25 ms Far Sync • Functional tests • It works well!!! but… • 3-7013523981: FRA not cleaned up automatically on FAR SYNC instance • 3-7023772221: Failover to alternate destination does not work with FAR SYNC • Both bugs still present in 12.1.0.1 production • Some configuration issues with Data Guard Broker 18 Far Sync testing results • Performance tests with simulated heavy DML workload Runtime Comparison Runtime (min) 40 40 40 17 30 20 7 17 16 16 16 16 9 10 0 ASYNC no FAR SYNC SYNC no FAR SYNC SYNC with FAR SYNC Primary Runtime SYNC no FAR SYNC 25ms latency SYNC with FAR SYNC 25ms latency Apply Runtime 256 parallel sessions inserting data in 500 row batches, 50 batches per session. The target table partitioned and indexed: 4 local b-tree indexes, 6 local bitmap indexes, global primary key index with reversed keys. Each session inserting data into it's own partition. 19 Far Sync testing results • Performance tests with Oracle Real Application Testing framework • Real production workload captured per schema Replay parameters: connect_time_scale=0 think_time_scale=0 CMSR – DML mostly workload LCGR – read only workload 300 Production Workloads 262.7 Runtimes w/o FarSync 250 Run duration (sec) • Workload replay with and without Far Sync 25ms latency Runtimes w/ FarSync 200 150 123.7 88.3 91.0 100 50 0 CMSR LCGR 20 Far Sync summary Redo Redo Transport 25 ms Far Sync • Very promising for long distance replication if data loss is not acceptable • Up to 60% performance gain (DML only workloads) with 25ms network latency • Lightweight and easy to deploy (virtual machine) • If latency <5ms most likely you don’t need Far Sync • There are still bugs that need fixing 21 Discussion 22

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Active Data Guard at CERN