Download Active Data Guard at CERN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Concurrency control wikipedia , lookup

Oracle Database wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Emil Pilecki
Credit: Luca Canali, Marcin Blaszczyk, Steffen Pade
Agenda
• About CERN
• Oracle and Data Guard
at CERN
• DG perks and benefits
• Zero data loss over long
distances (far sync)
• Far sync testing results
About CERN
• European Organization for Nuclear Research founded in 1954
• 21 member states, 2 candidates, 6 observers + UNESCO and UE
• 60 Non-member States collaborate with CERN
• 2500 staff members and 10 000 scientists
3
LHC and Experiments
• Large Hadron Collider (LHC)
– particle accelerator collides
beams at very high energy
• 27 km long circular tunnel
• Located ~100m underground
• Protons travel at 99.9999972% the
speed of light
• Collisions are analysed with
usage of special detectors
and software in the
experiments dedicated to LHC
• New particle discovered!
• Consistent with the Higgs Boson
• Announced on July 4th 2012
4
Oracle at CERN
• Since 1982, version 2.3
• Oracle DBs play a key role in the LHC production
chains
• Accelerator logging and monitoring systems
• Online acquisition, offline data (re)processing, data distribution,
analysis
• Grid infrastructure and operation services
• Monitoring, dashboards, etc.
• Data management services
• File catalogues, file transfers, etc.
• Metadata and transaction processing for tape storage system
• Administrative services
5
CERN’s Databases
• Over 100 Oracle databases, mostly RAC
• NAS storage plus some SAN with ASM
• ~400 TB of data files for production DBs
• Examples of CERN’s critical DBs:
• LHC logging database ~170 TB, expected growth up to 70 TB / year
• 13 Production experiments’ databases ~140 TB in total
• 15 production systems protected with Data Guard
• Active Data Guard since 11g
6
Our Data Guard architecture
Maximum performance
Maximum performance
Active Data Guard
for read only workloads
Redo Transport
Primary
Database
Active Data Guard
for read only workloads
and disaster recovery
Primary
Database
Active Data Guard
for disaster recovery
1. Low load ADG
2. Busy & critical ADG
LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL
ASYNC NOAFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
DB_UNIQUE_NAME=<standby_db_unique_name>’
7
(Active) Data Guard benefits
Features and functionalities we profit from:
• Data protection for disaster recovery
• Replication and offloading read only workload
• Database backups from standby
• Safeguard logical data corruptions with flashback
• Snapshot standby for testing
• Fast upgrades and hardware migrations
• Detection of lost writes
• Automatic block media recovery
8
Disaster recovery
• We have been using it since a few years
• Switchover/failover is our first line of defence
• Saved the day already for production services
• Current disaster recovery site at 10 km away from
our main datacentre
• Remote site in Hungary to be used soon
• Over 1000km distance
• Network latency of 25ms is a challenge
• Plan to move most of the standby databases there
within 1 year
9
Offloading production databases
• Efficient replication of the whole database
• Workload distribution
• Transactional workload runs on primary
• Read-only workload can be moved to ADG
• Read-mostly workload: DMLs can be redirected to
primary with a dblink
• Database backups from standby
• Significantly reduces load on primary by
• Removes sequential I/O of full backup
• ADG allows usage of block change tracking for fast
incremental backups
10
Flashback and snapshot standby
• Flashback enabled on standby only
• Recover from human errors and data corruptions
• Avoid impacting primary database with flashback logs
generation
• Snapshot standby
• Testing changes before implementing them on primary
• Safe – redo is still sent to standby
• Very easy to use
SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;
SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
11
Fast upgrades and migrations
4
2
5
1
6
3
DATABASE downtime
RDBMS upgrade
Upgrade complete!
Clusterware 11g
+
RDBMS 11g
Redo
RedoTransport
Transport
Clusterware
Clusterware 12c
12c
++
RDBMS
RDBMS 12c
11g
12
Fast upgrades and migrations
• Risk mitigation
•
•
•
•
Fresh installation of the new clusterware
Old system stays untouched
Allows full upgrade test
Allows stress testing of new system
• Downtime reduction
• ~ 1h for RDBMS upgrade
• Additional hardware required unless migration to
new one is expected anyway
13
Lost write detection and ABMR
Slave exiting with ORA-752 exception
Errors in file /ORA/dbs0a/PDBR_RAC50/diag/rdbms/pdbr_rac50/PDBR1/trace/PDBR1_pr0l_92600.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 67, block# 57976209, file offset is 2494701568 bytes)
ORA-10564: tablespace STRMMON
ORA-01110: data file 67: '/ORA/dbs03/PDBR_RAC50/datafile/STRMMON_67.dbf'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 435213427
Mon Apr 14 06:52:02 2014 Recovery Slave PR0L previously exited with exception 752
• Stops redo application when a lost write is detected
• Previous consistent block version still on standby
• Helps to diagnose and repair the error
• Automatic Block Media Recovery with ADG
• Fixes physical block corruptions
• Works both ways: Primary  ADG
14
Zero data loss replication
• Use synchronous redo transport method
LOG_ARCHIVE_DEST_X=‘SERVICE=<tns_alias> OPTIONAL
SYNC AFFIRM VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
DB_UNIQUE_NAME=<standby_db_unique_name>’
• DML statements impacted due to commit
acknowledgment on standby
Redo Transport
Commit Ack
Primary Database
Network latency matters!!!
Data Guard Standby
15
Far Sync concepts
Long distances = high network latency =
slow commit acknowledge with SYNC redo transport
25 ms
Redo Transport
16
Far Sync testing at CERN
Redo
Redo Transport
25 ms
Far Sync
• Functional
• Does it work? Are there any bugs?
• Performance
• Simulated heavy DML workload with and without Far Sync
• Oracle Real Application Testing – workload captured from
production databases
17
Far Sync testing results
Redo
Redo Transport
25 ms
Far Sync
• Functional tests
• It works well!!! but…
• 3-7013523981: FRA not cleaned up automatically on FAR
SYNC instance
• 3-7023772221: Failover to alternate destination does not
work with FAR SYNC
• Both bugs still present in 12.1.0.1 production
• Some configuration issues with Data Guard Broker
18
Far Sync testing results
• Performance tests with simulated heavy DML
workload
Runtime Comparison
Runtime (min)
40
40
40
17
30
20
7
17
16
16
16
16
9
10
0
ASYNC no FAR
SYNC
SYNC no FAR
SYNC
SYNC with FAR
SYNC
Primary Runtime
SYNC no FAR
SYNC 25ms
latency
SYNC with FAR
SYNC 25ms
latency
Apply Runtime
256 parallel sessions inserting data in 500 row batches, 50 batches per session.
The target table partitioned and indexed: 4 local b-tree indexes, 6 local bitmap
indexes, global primary key index with reversed keys.
Each session inserting data into it's own partition.
19
Far Sync testing results
• Performance tests with Oracle Real Application
Testing framework
• Real production workload captured per schema
Replay parameters:
connect_time_scale=0
think_time_scale=0
CMSR – DML mostly workload
LCGR – read only workload
300
Production Workloads
262.7
Runtimes w/o FarSync
250
Run duration (sec)
• Workload replay with
and without Far Sync
25ms latency
Runtimes w/ FarSync
200
150
123.7
88.3 91.0
100
50
0
CMSR
LCGR
20
Far Sync summary
Redo
Redo Transport
25 ms
Far Sync
• Very promising for long distance replication if data
loss is not acceptable
• Up to 60% performance gain (DML only workloads)
with 25ms network latency
• Lightweight and easy to deploy (virtual machine)
• If latency <5ms most likely you don’t need Far Sync
• There are still bugs that need fixing
21
Discussion
22