Download document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
The DB2 Detective –
The Case of the Diligent DBA
Bill Minor
IBM
Sept. 2009
Highlights
No matter the extent of preparation, it seems that unexpected outages pop
up when least expected. As a result, unplanned downtime can cause
significant impact to business operations. When misfortune strikes, there are
always two predominant questions, "What happened?" and "When will the
system be available again?" Mysteries can be resolved by good DB2
Detective work - identifying and circumventing actions that can lead to
problems. When unexpected problems do occur, it is vital to be able to act
decisively and swiftly.
The goal of this session is to introduce database inspection tools and provide
insights that can be used to help monitor the integrity and maintain the
consistency of your data. I will also provide guidance for salvage, repair,
recovery, and mitigation operations that are vital when crisis strikes and time
is of the essence.
2
Agenda
•
•
•
•
•
•
Help with defining a problem
Help with collecting and understanding information/diagnostics related to a
problem
Tools to help in the event of an outage
Tools to help avoid outages in the first place
Features to help avoid, control, and reduce the impact of an unexpected
problem (Technology Preview or TP)
RAS (Reliability, Availability and Serviceability)
3
In the Beginning ….
•
‘There is an outage’ !!!!!! Where do we start?
…… Take a deep breath … (DB2 911)
•
Keep it simple to begin; fundamentals are essential
•
Order of Investigation/ ‘Case’:
 Problem Description
Problem Determination
 Problem Resolution
4
‘Tools’ in the Detective Kit
•
•
•
•
•
•
•
•
•
•
The problem description
db2diag.log
db2trc
db2pd
db2dart and inspect
db2top
db2exfmt
admin functions
db2cos, db2support, db2fodc
Resiliency (mitigation/toleration/avoidance)
5
An ‘Outage’
•
“Outage” – subjective terminology:
 Data availability:
 slow access
 no access
 Unexpected crashes
•
Notes:
• There is always a trade-off between how much time can be spent
investigating a problem and making system available again to users
• Other factors: risk mitigation; establishing cause; development of a
workaround (in absence of a ‘fix’ obviously don’t want to hit this again)
An outage can be planned (i.e. maintenance) or unplanned (unexpected
errors, terminations, slow or unavailable access to data, …).
‘Path to DBA diligence’
 Handling and avoiding unplanned outages via: investigation, problem
definition, monitoring, prevention, education, …… (hopefully something
from this presentation ) …
6
What Information Needs to be Collected?
•
•
•
•
Each problem has it own set of one or more signatures
Problems (even those with the same source – so ‘same
problem’) can manifest themselves in different ways
• Systems and environments are different: hardware,
software, different users, different data (scope and
volume), …
There is a set of fundamental or initial information/diagnostics
that needs to be collected for every problem .. No matter what
Keep in mind that additional information or diagnostics could
be required depending on the initial analysis investigation
• Goal: To minimize the amount of effort required in
collecting diagnostics
7
The Problem Description
•
VERY VERY VERY IMPORTANT (may seem obvious)
•
•
Avoid taking the wrong path
Invest in:
• Establishing “What happened?”
• Getting the correct information and diagnostics
REQUIRED FUNDAMENTALS:
•
•
•
•
•
•
What is the current state/situation?
• Describe what happened and what needs to happen
• Include recent problem or change history
• Client/Server configuration
What changed? No really, what changed? (Usually something has but just not
apparent how, why or when)
Include specific actions known to be performed at the time of the incident
Product version and release number (FP? Special build?)
SQL error codes
Scope: System Wide? Instance Wide? Database? Application(s)?
8
Performance Issues – What? and When?
•
•
Identification of such an issue requires a baseline – a before and after
comparison
Typically the issue is, ‘things are slow’ (really mean ‘slower’)
• Slow query
•
•
•
•
•
•
•
After running RUNSTATS
After applying a fixpack
After migrating to a new release of DB2
After making a database configuration change
• Performance of triggers, stored procedures, UDFs with SQL,
crashes/traps during query execution
Differentiate between overall system performance and specific actions
Date/time correlation? Other maintenance activities?
Different systems? (db2pd –osinfo)
9
Problem ‘Scope’
•
Scope and Cause are required in order develop and execute action
plan; action plan may be ‘straightforward’ i.e. restore the database
•
Manifest itself in one of the following outage scenarios:
• Instance trap – db2start required
• Database trap
• Application failure
10
Diagnostics to Collect at the Outset
Fundamental
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
DB2 Build Level
Operating System type and level
DB2 registry variable settings
MPP? SMP? Single-processor? Database partitioned? Which partition?
Syntax of command/statement associated with issue
DB2 error received
db2diag.log file which sufficient history
Notification log
DB2 Trap files (Call stacks), DB2 dump files
History File
db2 trace
Messages file
Core file, system error logs (Unix:errpt Windows: event logs)
Log files and log file control structures
…
11
db2diag.log file
•
•
•
Where is it? $HOME/sqllib/db2dump/
What is it? ‘Supplementary’ diagnostic logging facility
How to read? Each entry:
• Time ordered entries
• Database specific; Database node/partition (i.e. DPF specific)
• Process/thread ID; Application specific identifiers/handles
• Function name where entry was generated
• Messaging and perhaps some data
2009-03-03-10.26.14.523824-300 I37414837E502
LEVEL: Error
PID
: 8083
TID : 54
PROC : db2sysc 20
INSTANCE: svtdbm3
NODE : 020
DB
: MASUN1
APPHDL : 0-138
APPID: *N0.svtdbm3.090303150827
AUTHID : SVTDBM3
EDUID
: 54
EDUNAME: db2agntp (MASUN1) 20
FUNCTION: DB2 UDB, buffer pool services, sqlbDMSCreatePool, probe:800
MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full."
DIA8312C Disk was full.
•
Can be parsed with diag tool and/or admin function (examples in Backup Slides)
12
db2trc – Trace facility
•
•
•
Useful for problems/issues that can be recreated
Generates a flow and format of the code paths taken by DB2 while the trace
is active
• Traces all DB2 activity in an instance on a physical node (database
partition). If many logical db2 nodes, all activity is traced
• Trace to memory buffer or file (slower)
• Trace to file for trace wraps
Always include corresponding db2diag.log file with a db2 trace
Examples:
db2trc on –l 8M
… trace to memory
db2trc on –f mytrace.trc
… trace to file
db2trc inf
…. Information returned when trace is active)
db2trc flw mytrace.trc mytrace.flw
……
db2trc fmt mystrace.trc mytrace.fmt
…… Readable versions of trace data
13
db2dart (Database Analysis and Reporting Tool)
•
Primarily to examine databases for architectural correctness
• Check validity of meta-data structures, page headers, row headers, …
• Runs against data as it exists on disk (primarily an offline tool)
• Critical for investigating problems associated with data ‘corruption’
• Granularity: database /db, tablespace /ts, table /t
EXAMPLES:
db2dart SAMPLE /db
db2dart SAMPLE /ts /tsi 4
db2dart SAMPLE /t /tsi 2 /oi 5
•
•
•
Marking problematic objects as invalid; recovering or extracting data
Consistency: Index key data checking
HWM – space tablespace space management
14
Inaccessible Table – Salvaging Data
•
Example: db2dart SAMPLE /ddel
…
Table ID or name, tablespace ID, first page, num of pages:
(suffic page number with 'p' for pool relative)
13 2 0 100
7 of 7 columns in the table will be dumped.
Column numbers and datatypes of the columns dumped:
0 SMALLINT
1 VARCHAR() -VARIABLE LENGTH CHARACTER STRING
2 SMALLINT
3 CHAR() -FIXED LENGTH CHARACTER STRING
4 SMALLINT
5 DECIMAL
6 DECIMAL
Default filename for output data file is TS2T13.DEL,
do you wish to change filename used? y/n
n
Filename used for output data file is TS2T13.DEL.
will be appended to it.
If existing file, data
15
Marking Indexes As Invalid – db2dart /MI
Example:
db2dart testdb /mi /tsi 2 /oi 4
•
Time of index rebuild dictated by setting of INDEXREC database cfg
variable
• Usually one of RESTART (default) or ACCESS
• RESTART – database restarted
• ACCESS – when index is first accessed
•
Messages written to db2diag.log file indicating index rebuild
16
Checking Database Backup Images
•
•
Database backup images represent one of your ultimate recovery options
Hence, checking their integrity is prudent
•
db2ckbkp
• Check the integrity of a database backup image
• Can extract pages from the backup image to a file (-e option)
• This may be part of a problem causal analysis or repair procedure
 What ‘state’ is the data in the backup image?
•
Tivoli Storage Manager (TSM): db2adutil -verify
17
History File
•
Records information specific to the following database actions:
• BACKUP, ROLLFORWARD, DROPPED TABLE, LOAD, CREATE
TABLESPACE, RENAME TABLESPACE, REORG, ALTER TABLESPACE,
ARCHIVE LOG
•
Dropped table recovery
• CREATE or ALTER TABLESPACE option DROPPED TABLE RECOVERY
• Allows recovery of dropped table data using table space-level restore and
rollforward operations
• Faster than database recovery and database available to users
18
LIST HISTORY – Dropped Table Recovery
1 Identify the table which was dropped
Op Obj Timestamp+Sequence Type Dev Earliest Log Current Log Backup ID
-- --- ------------------ ---- --- ------------ ------------ -------------D T 20090223125627
000000000000693f00020013
---------------------------------------------------------------------------"BMINOR "."STAFF3" resides in 1 tablespace(s):
00001 USERSPACE1
---------------------------------------------------------------------------Comment: DROP TABLE
Start Time: 20090223125627
End Time: 20090223125627
Status: A
---------------------------------------------------------------------------EID: 9
DDL: CREATE TABLE "BMINOR "."STAFF3" ( "ID" SMALLINT NOT NULL , "NAME" VARCHAR(9) ,
"DEPT" SMALLINT , "JOB" CHAR(5) , "YEARS" SMALLINT , "SALARY" DECIMAL(7,2) , "COMM"
DECIMAL(7,2) ) IN "USERSPACE1" ;
----------------------------------------------------------------------------
19
Dropped Table Recovery
(continued)
2 Restore a database- or table space-level backup image taken
before the table was dropped
3 Create an export directory to which files containing the table data
are to be written
4 Roll forward to a point in time after the table was dropped, using
the RECOVER DROPPED TABLE option on the
ROLLFORWARD DATABASE command (alternatively, roll
forward to the end of the logs, so that updates to other tables in
the table space or database are not lost)
5 Re-create the table using the CREATE TABLE statement from
the recovery history file.
6 Import the table data that was exported during the rollforward
operation into the table
20
Performance – Explain Plans
•
•
•
Optimizer access plans change … could be legitimate, maybe not
Compare plans, stats, configuration: before vs after
Monitor
Example:
db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL
db2 explain all for “select * from bminor.staff where id=10”
db2exfmt -d sample -n % -e % -s % -w -l -# 0 -o explainPlan.out -g TIC
21
db2exfmt Output - Example
Objects Used in Access Plan:
--------------------------Access Plan:
----------Total Cost:
Query Degree:
Rows
RETURN
(
1)
Cost
I/O
|
6.28
FETCH
(
2)
11.5946
1
/---+---\
6.28
157
IXSCAN TABLE: BMINOR
(
3)
STAFF
3.2496
Q1
0
|
157
INDEX: BMINOR
I1_STAFF
Q1
11.5946
1
Schema: BMINOR
Name: I1_STAFF
Type: Index
Time of creation:
2009-03-02-15.52.41.836445
Last statistics update:
Number of columns:
1
Number of rows:
157
Width of rows:
-1
Number of buffer pool pages: 1
Distinct row values:
No
Tablespace name:
USERSPACE1
Tablespace overhead:
7.500000
Tablespace transfer rate:
0.060000
Source for statistics:
Single Node
Prefetch page count:
32
Container extent page count: 32
…
Schema: BMINOR
Name: STAFF
Type: Table
Time of creation:
2009-03-02-15.50.29.727941
Last statistics update:
Number of columns:
7
Number of rows:
157
Width of rows:
40
Number of buffer pool pages: 1
Number of data partitions:
1
Distinct row values:
No
Tablespace name:
USERSPACE1
Tablespace overhead:
7.500000
Tablespace transfer rate:
0.060000
22
db2pd – Monitor and Troubleshoot DB2
• Indispensable for monitoring and troubleshooting
• Non-intrusively capable of obtaining various statistics, snapshot
information, internal meta-data, … from an instance
• Relevant options (some of my favorites):
• -osinfo, -mempools, -memblocks, -edus (more to come), -stack,
-applications, -transactions, -bufferpools, -logs, -pages,
-tablespaces, -tcbstats index, -apinfo, -fvp, …
23
Achieving Understanding via SQL Routines
•
•
DB2 has a very rich and diverse set of SQL routines that can be used probe
and monitor one’s system
For example:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
PD_GET_DISG_HIST
PD_GET_LOG_MSGS
PDLOGMSGS_LAST24HOURS
DB_HISTORY
LOG_UTILIZATION
LONG_RUNNING_SQL
LOCKS_HELD
LOCKWAITS
CONTAINER UTILIZATION
TBSP_UTILIZATION
ADMIN_GET_TAB_INFO, ADMIN_GET_INDEX_INFO
APPL_PERFORMANCE
BP_HITRATIO, BP_READ_IO, BP_WRITE_IO
TOP_DYNAMIC_SQL
MON_GET_BUFFERPOOL, MON_GET_TABLE, MON_GET_TABLESPACE, ….
24
db2top
•
What is it?
• DB2 dynamic monitoring tool. But so are snapshots?
• Snapshot data is typically cumulative (point in time)
• db2top provides delta values in real time
• db2top integrates multiple types of snapshot data together
• Diagnose performance and resource problems, optimize
• Provides a unified, single-system view of a multi-partition database or
single-partition database for the purposes of providing a dynamic realtime view of a running DB2 system
• Availability: V8 FP17, V9.1 FP6, V9.5 FP2
• developerworks whitepaper:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/
25
When Unexpected Problems Occur –
db2cos :DB2 Call Out Script
•
Invoked automatically by Database invoked by default when the database manager
cannot continue processing due to a panic, trap, segmentation violation or exception
•
Each default db2cos script will invoke db2pd commands to collect information in an
unlatched manner
•
Order of events when trap/panic occurs:
• Trap file is created
• Signal handler is called
• db2cos script is called
• An entry is logged in the administration notification log
• An entry is logged in the db2diag log file
• Information is generated by db2pd (operating system, the Version and Service
Level installed, the database manager and database configuration, as well as
information about the state of the agents, memory pools, memory sets, memory
blocks, applications, utilities, transactions, …)
26
db2support – Diagnostic Data Collection
• Collects environment data about either a client or server machine
and places the files containing system data into a compressed file
archive
• Collects db and dbm cfg info, reg var settings, db2level, db2diag.log,
… all into one file called db2support.zip
• Collects up diagnostics generated by db2cos* scripts
• db2supp_opt.zip within db2support.zip
 This file contains a host of optimizer related info: query text,
exfmt output (explain of query), db2look and db2batch output,
catalog table info, …
• To assist with Optimizer (plan) problems
db2support . –db <dbname> -st “select * from bminor.staff”
27
db2pdcfg
• Basically a tool which can set or change flags which influence the
database system behavior for problem determination purposes
• Control behaviors of cos and fodc (more to come on this one)
• Neat option: db2pdcfg –flushbp –db <dbname>
• Allow faster failover from primary to secondary in HADR setups
• Monitor and tune bufferpools for number of dirty pages (use
inconjunction with –pages and –bufferpools)
• Sync in-memory tablespace meta-data structures with on-disk
versions
• Execute before using db2dart /DHWM against active database
28
First Occurrence Data Capture (FODC)
•
A facility (V9.5) to capture diagnostic information when a problem first occurs
• Automatic mode: invoked by DB2 engine when outage or error condition
is detected
• Traps, panics, corruptions
• Manual mode: user initiated – db2fodc
• Hangs, performance issues
•
Current setting: db2pdcfg –fodc status
• db2pdcfg can be used to change defaults
•
Information collected and placed in subdirectory under DIAGPATH
• FODC_<outagetype>_<timestamp>
•
Run db2support after db2fodc to collect all the fodc information and package
into one zip file that can be sent to Support
29
db2fodc Example
Example: db2fodc –hang –db SAMPLE
2009-03-03-11.33.19.657457-300 E1683E654
LEVEL: Warning
PID
: 619
TID : 47350926685104PROC : db2fodc
INSTANCE: bminor
NODE : 000
FUNCTION: DB2 UDB, RAS/PD component, pdDb2FODCMain, probe:30
MESSAGE : ADM14003W FODC has been invoked by the user from db2fodc tool for
symptom "hang" and diagnostic information has been recorded in
directory
"/home/bminor/sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318".
Please look in this directory for detailed evidence about what
happened and contact IBM support if necessary to diagnose the
problem.
File Listing for ../sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318
6116.1.000.stack.txt
6116.11.000.stack.txt
6116.12.000.stack.txt
6116.13.000.stack.txt
6116.14.000.stack.txt
6116.15.000.dump.bin
6116.15.000.stack.txt
6116.16.000.apm.bin
6116.16.000.dump.bin
6116.16.000.stack.txt
db2eventlog.000
6116.17.000.stack.txt
6116.18.000.stack.txt
6116.19.000.stack.txt
6116.20.000.stack.txt
6116.21.000.stack.txt
6116.22.000.stack.txt
6116.23.000.stack.txt
6116.24.000.stack.txt
6116.25.000.stack.txt
6116.26.000.stack.txt
db2fodc_hang.log
6116.27.000.stack.txt
6116.28.000.stack.txt
6116.29.000.apm.bin
6116.29.000.dump.bin
6116.29.000.stack.txt
6116.30.000.apm.bin
6116.30.000.dump.bin
6116.30.000.stack.txt
6116.31.000.apm.bin
6116.31.000.dump.bin
6116.31.000.stack.txt
6116.32.000.apm.bin
6116.32.000.dump.bin
6116.32.000.stack.txt
6116.33.000.apm.bin
6116.33.000.dump.bin
6116.33.000.stack.txt
6116.34.000.apm.bin
6116.34.000.dump.bin
6116.34.000.stack.txt
DB2CONFIG
DB2PD
DB2SNAPS
DB2TRACE
OSCONFIG
OSSNAPS
OSTRACE
30
Resiliency
• Avoiding and dealing with ‘errors’ that can result in
unplanned outages
31
Online Data Consistency Checking (V9.5)
•
Index to Data consistency problems can be categorized as:
 Orphaned index entry:
An index entry points to a deleted or non-existent record slot
 Invalid index entry:
An index entry points to a valid slot on a data page, but index key data
does not match record data
 Missing key:
A data row is not indexed in at least one of the indexes on the table
 Duplicate Record IDs (RIDs):
A data row is indexed more than once
These are classified as logical corruptions.
32
Key-Data Checking via DB2 INSPECT (V9.5)
•
New INDEXDATA cross-object checking clause to the INSPECT utility
Example:
db2 inspect check table name STAFF indexdata results keep insp.out
NOTE:
The INDEXDATA option only examines the logical inconsistency between
index and data. Therefore, it is recommended that you first run INDEX and
DATA checking separately, to rule out any physical corruption, before
running INDEXDATA checking.
33
Threaded Engine Model for Unix in V9.5
•
•
•
•
•
DB2 has always had a threaded engine model on Windows platforms
In V9.5, DB2 Unix engine has now evolved from process-based to threadbased model
All threads within a process can share the memory allocated by, or within,
that process
Several advantages: basically performance and memory management
From a DBA perspective:
• Better administration and tuning of memory through administration
• Resiliency improvements (upcoming slides)
34
DB2 Processes vs. Threads: ps -ef
•
V9.1:
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
bminor
•
6107356
6148108
6180952
6193316
6660150
6717616
884882
1237024
1458238
1556708
1757202
1908750
2195682
4780132
6717616
1458238
1458238
6107356
6107356
1458238
1458238
6107356
6717616
1458238
1458238
1
0
0
0
0
0
0
0
0
0
0
0
0
0
13:00:53
13:01:03
13:08:22
13:08:22
13:00:53
13:00:53
13:08:22
13:08:22
13:00:53
13:01:07
13:08:22
13:08:22
13:08:21
-
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:14
0:00
0:00
0:00
db2sysc
db2agent (SAMPLE)
db2pclnr
db2pclnr
db2licc
db2ipccm
db2loggw (SAMPLE)
db2lfr (SAMPLE)
db2gds
db2stmm (SAMPLE)
db2dlock (SAMPLE)
db2pclnr
/home4/bminor/sqllib/bin/db2bp
V9.5:
bminor
bminor
bminor
10364 10362
10377 10362
10385
1
1 13:15 pts/19
1 13:15 pts/19
0 13:15 pts/19
00:00:00 db2sysc
00:00:00 db2acd,0,0,0,1,0,0,1,0,8a18
00:00:00/home/bminor/sqllib/bin/db2bp
35
Running Threads: db2pd -edus
Database Partition 0 -- Active -- Up 2 days 17:59:14
List of all EDUs for database partition 0
db2sysc PID: 1187
db2wdog PID: 1185
db2acd PID: 1201
EDU ID
TID
Kernel TID
EDU Name
USR
SYS
=========================================================================================================
65
47726691608896 20490
db2agntdp (SAMPLE )
0.000000
0.000000
64
47726695803200 999
db2agent (SAMPLE)
0.240000
0.110000
63
47726708386112 998
db2agent (SAMPLE)
0.720000
0.160000
61
47726712580416 308
db2evmli (DB2DETAILDEADLOCK)
0.000000
0.000000
60
47726716774720 307
db2wlmd (SAMPLE)
0.000000
0.000000
59
47726720969024 306
db2pfchr (SAMPLE)
0.000000
0.000000
57
47726737746240 304
db2pfchr (SAMPLE)
0.000000
0.000000
55
47726733551936 302
db2pclnr (SAMPLE)
0.000000
0.000000
54
47726729357632 301
db2pclnr (SAMPLE)
0.000000
0.000000
53
47726746134848 300
db2dlock (SAMPLE)
0.000000
0.000000
52
47726762912064 32767
db2lfr (SAMPLE)
0.000000
0.000000
51
47726741940544 32766
db2loggw (SAMPLE)
0.020000
0.030000
50
47726758717760 32765
db2loggr (SAMPLE)
0.000000
0.030000
49
47726699997504 1554
db2taskd (SAMPLE)
0.000000
0.000000
25
47726754523456 1365
db2stmm (SAMPLE)
8.020000
0.460000
16
47726767106368 1265
db2agent (SAMPLE)
1.420000
0.810000
15
47726771300672 1199
db2resync
0.000000
0.000000
14
47726775494976 1196
db2ipccm
0.080000
0.030000
13
47726779689280 1195
db2licc
0.000000
0.000000
12
47726783883584 1194
db2thcln
0.000000
0.000000
11
47726788077888 1192
db2alarm
0.020000
0.020000
1
47726528031040 1191
db2sysc
5.420000
5.620000
36
Memory Protection via Storage Keys (V9.5)
•
DB2 is leveraging technology which is specific to the POWER6 platform and
AIX (TL5300-06 or greater)
• Hardware feature called Storage Keys
• Allows for memory protection functionality
• DB2 V9.5 has the capability to detect erroneous access to critical
regions of it’s own memory, primarily bufferpool memory
• This is pro-active protection:
• When such access is detected it is stopped before a corruption can
result
• Offending UDF or Stored Procedure receives an error code
indicating that an incorrect memory access was performed
• Cannot protect against all types of bugs but it does provide an added
layer of protection
37
Enablement of Memory Protection via
Storage Keys (V9.5)
•
Registry variable:
DB2_MEMORY_PROTECT = [ NO | YES ]
•
Error DBI1301E on attempted set if:
• Hardware, OS does not support storage keys
• Large page support (i.e. DB2_LARGE_PAGE_MEM) is enabled
38
Higher Availability with Storage Key Support
•
Thread Suspension: An additional level of protection that will increase
availability of systems when abnormal termination of a DB2 Kernel Thread
occurs
DB2_THREAD_SUSPENSION = [ OFF | ON ]
•
What is it?
 It allows you to control whether a DB2 instance sustains a trap by
suspending a faulty DB2 kernel thread (a thread that has tried illegally to
access memory protected with storage protection keys).
39
Error Toleration (TP)
• Goal: Reduce unplanned outages
• Error toleration via:
• Trap resilience
• Physical read error toleration
• Logical read error toleration
• Registry variable:
• DB2RESILIENCE=ON/TRUE or not defined (ie default behavior)
activates error toleration support for the above
• DB2RESILIENCE=OFF/FALSE disables these capabilities.
40
Extended Trap Resilience (TP)
•
Registry Variable:
DB2RESILIENCE= [ ON | OFF ]
•
When a DB2 server detects a trap the following actions will be taken by
DB2:
 First Occurrence Data Capture (FODC) will be activated
 Execution stack is dumped
 If Storage Key support is active then process thread suspension
decisions
 If Storage Key suspension decisions allow for thread suspension,
suspend thread
 If Storage Key support is not active or fails decision points to suspend
thread, then proceed with non Storage Key survival decisions (ie. Utilize
this new capability)
 DB2 will determine if the current environment is safe to allow recovery
41
Messaging:
2008-11-17-09.56.27.542572-300 I75439E563
LEVEL: Severe
PID
: 28040
TID : 47643040409920PROC : db2sysc
INSTANCE: marvin
NODE : 000
DB
: TESTDB
APPHDL : 0-21
APPID: *LOCAL.marvin.081117145537
AUTHID : MARVIN
EDUID
: 29
EDUNAME: db2agent (TESTDB)
FUNCTION: DB2 UDB, RAS/PD component, pdResilienceIsSafeToRollback,probe:800
DATA #1 : String, 37 bytes
Trap Sustainability Criteria Checking
DATA #2 : Hex integer, 8 bytes
0x00000000000A0400
DATA #3 : Boolean, 1 bytes
true
DATA #1 A generic description of the data following
DATA #2 The hex dump of the trap sustainment criteria
DATA #3 The decision on whether to attempt to sustain the trap (true/false)
42
Physical Read Error Toleration (TP)
• Physical read errors are detected when retrieving DB2 pages from the
storage medium (for disk drive).
• This error could be caused by issues with the contents of the data on
disk or by problems accessing the physical device on which the data
resides. Currently, when DB2 detects either of these errors the DB2
instance is shutdown.
• This solution will provide enhanced toleration of physical read errors
such that if such an error is encountered, rather than force the DB2
instance offline, the SQL statement or operation being executed will
fail, a new SQLCODE will be returned to the application when
appropriate, logging of appropriate diagnostics will occur, and the
DB2 instance will remain active.
43
Toleration of Physical Read Errors
•
•
•
•
•
•
•
•
•
When a DB2 server detects a physical read error there will be a maximum of two
more attempts to read the page from disk and validate the DB2 page header. If
at the end of the multiple attempts an error is still determined to exist then:
• If DB2RESILIENCE=ON
Sqlcode -1655E will be returned to the application
The SQL statement or DB2 operation fails
Administration message ADM6006E will be placed in the DB2 Administration log
at notify level URGENT
Administration message ADM6006E will be placed in the DB2 diagnostic log at
level URGENT
First Occurrence Data Capture (FODC) will be activated when the error is
detected and will be turned off when the sqlcode is returned to the application.
This will allow collection of diagnostic data generated due to this failure detection
for further analysis by DB2 Support.
The database instance remains available
Application connection remains available for continued operations
• If DB2RESILIENCE=OFF then panic the instance (that is, revert to old
behavior)
This algorithm will aid in reducing application errors which are a result of
transient file system I/O errors.
44
Logical Read Error Toleration (TP)
•
•
•
Inconsistency detected by DB2 when processing data contained within a DB2
page
Instead of hard failure (database or instance shutdown), isolate the issue and
allow overall database use to continue
• Problematic page is removed from the bufferpool, relevant diagnostics
are generated, notification is returned (messages to admin and
db2diag.log and new sqlcode to application)
Leveraged within the engine where it makes sense
45
Messaging
SQL1656E
An error occurred while processing data. The operation could not be completed, but the database
remains accessible. Contact IBM Software Support.
Explanation:
The failure to complete the operation is due to a detected DB2 page inconsistency. The SQL
statement has failed or the operation has been aborted. The database remains accessible.
User response:
Contact IBM Software Support for instructions on what diagnostic data to collect to assist in
resolving this issue. The application can retry the operation, although it may continue to fail.
sqlcode: -1656
sqlstate: 58004
ADM6007C
DB2 detected an error while processing page page-number from table space tbspace-id for
object object-id of object type object-type.
Explanation
DB2 was not able to complete the operation, but the database remains accessible. This may
indicate the presence of a serious problem that could result in a future outage.
User response
Please contact IBM Software Support and they will guide you through the proper corrective
actions.
46
Resources To Help You
•
•
•
Developerworks (Best Practices Papers, …)
IBM Database Magazine
IDUG Web Site and Solutions Journal
• DB2 LUW Chat with the Lab
http://www-01.ibm.com/software/data/db2/9/labchats.html
• IBM DB2 Support home:
www.ibm.com/software/data/db2/support/db2_9
• DB2 Fixes:
http://www-01.ibm.com/software/data/db2/support/db2_9/download.html
•
Conferences, RDUGs, Customer Lab Advocate, Early Adoption Programs, …
47
The DB2 Detective – The Case of the Diligent DBA
Bill Minor
IBM
[email protected]
48