Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OpenVMS Distributed Lock Manager Performance Session ES-09-U Keith Parris HPQ Background VMS system managers have traditionally looked at performance in 3 areas: CPU Memory I/O But in VMS clusters, what may appear to be an I/O bottleneck can actually be a lock-related issue Overview VMS keeps some lock activity data that no existing performance management tools look at Locking statistics and lock-related symptoms can provide valuable clues in detecting disk, adapter, or interconnect saturation problems Overview The VMS Lock Manager does an excellent job under a wide variety of conditions to optimize locking activity and minimize overhead, but: In clusters with identical nodes running the same applications, remastering can sometimes happen too often In extremely large clusters, nodes can “gang up” on lock master nodes and overload them Locking activity can contribute to: CPU 0 saturation in Interrupt State Spinlock contention (Multi-Processor Synchronization time) We’ll look at methods of detection, and solutions to, these types of problems Topics Available monitoring tools for the Lock Manager How to map VMS symbolic lock resource names to real physical entities Lock request latencies How to measure lock rates Topics Lock mastership, and why one might care about it Dynamic lock remastering How to detect and prevent lock mastership thrashing How to find the lock master node for a given resource tree How to force lock mastership of a given resource tree to a specific node Topics Lock queues, their causes, and how to detect them Examples of problem locking scenarios How to measure pent-up remastering demand Monitoring tools MONITOR utility MONITOR LOCK MONITOR DLOCK MONITOR RLOCK (in VMS 7.3 and above; not 7.2-2) MONITOR CLUSTER MONITOR SCS SHOW CLUSTER /CONTINUOUS DECamds / Availability Manager DECps (Computer Associates’ Unicenter Performance Management for OpenVMS, earlier Advise/IT) Monitoring tools ANALYZE/SYSTEM New SHOW LOCK qualifiers for VMS 7.2 and above: /WAITING Displays only the waiting lock requests (those blocked by other locks) /SUMMARY Displays summary data and performance counters New SHOW RESOURCE qualifier for VMS 7.2 and above: /CONTENTION Displays resources which are under contention Monitoring tools ANALYZE/SYSTEM New SDA extension LCK for lock tracing in VMS 7.2-2 and above SDA> LCK !Shows help text with command summary Can display various additional lock manager statistics: SDA> LCK STATISTIC !Shows lock manager statistics Can show busiest resource trees by lock activity rate: SDA> LCK SHOW ACTIVE !Shows lock activity Can trace lock requests: SDA> SDA> SDA> SDA> LCK LOAD LCK START TRACE LCK STOP TRACE LCK SHOW TRACE !Load the debug execlet !Start tracing lock requests !Stop tracing !Display contents of trace buffer Can even trigger remaster operations: SDA> LCK REMASTER !Trigger a remaster operation Mapping symbolic lock resource names to real entities Techniques for mapping resource names to lock types Common prefixes: SYS$ for VMS executive F11B$ for XQP, file system RMS$ for Record Management Services See Appendix H in Alpha V1.5 IDSM or Appendix A in Alpha V7.0 version Resource names Example: XQP File Serialization Lock Resource name format is “F11B$s” {Lock Basis} Parent lock is the Volume Allocation Lock “F11B$v” {Lock Volume Name} Calculate File ID from Lock Basis Lock Basis is RVN and File Number from File ID (ignoring Sequence Number), packed into 1 longword Identify disk volume from parent resource name Resource names Identifying file from File ID Look at file headers in Index File to get filespec: Can use DUMP utility to display file header (from Index File) $ DUMP /HEADER /IDENTIFIER=(file_id) /BLOCK=COUNT=0 disk:[000000]INDEXF.SYS Follow directory backlinks to determine directory path See example procedure FILE_ID_TO_NAME.COM (or use LIB$FID_TO_NAME routine to do all this, if sequence number can be obtained) Resource names Example: RMS lock tree for an RMS indexed file: Resource name format is “RMS$” {File ID} {Flags byte} {Lock Volume Name} Identify filespec using File ID Flags byte indicates shared or private disk mount Pick up disk volume name This is label as of time disk was mounted Sub-locks are used for buckets and records within the file Internal Structure of an RMS Indexed File Root Index Bucket Level 1 Index Bucket Level 2 Index Bucket Level 2 Index Bucket Level 1 Index Bucket Level 2 Index Bucket Level 2 Index Bucket Data Bucket Data Bucket Data Bucket Data Bucket Data Bucket Data Bucket Data Bucket Data Bucket Data Bucket RMS Data Bucket Contents Data Bucket Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record Data Record RMS Indexed File Bucket and Record Locks Sub-locks of RMS File Lock Bucket lock: Have to look at Parent lock to identify file 4 bytes: VBN of first block of the bucket Record lock: 8 bytes (6 on VAX): Record File Address (RFA) of record Locks and File I/O Lock requests and data transfers for a typical RMS indexed file I/O (prior to 7.2-1H1): 1) Lock & get root index bucket 2) Lock & get index buckets for any additional index levels 3) Lock & get data bucket containing record 4) Lock record 5) For writes: write data bucket containing record Note: Most data reads may be avoided thanks to RMS global buffer cache Locks and File I/O Since all indexed I/Os access Root Index Bucket, contention on lock for Root Index Bucket of hot file can be a bottleneck Lookup by Record File Address (RFA) avoids index lookup on 2nd and subsequent accesses to a record Lock Request Latencies Latency depends on several things: Directory lookup needed or not Local or remote directory node $ENQ or $DEQ operation Local or remote lock master If remote, type of interconnect Directory Lookups This is how VMS finds out which node is the lock master Only needed for 1st lock request on a particular resource tree on a given node Resource Block (RSB) remembers master node CSID Basic conceptual algorithm: Hash resource name and index into lock directory vector, which has been created based on LOCKDIRWT values Lock Request Latencies Local requests are fastest Remote requests are significantly slower: Code path ~20 times longer Interconnect also contributes latency Total latency up to 2 orders of magnitude higher than local requests Lock Request Latency Client process on same node: 4-6 microseconds Lock Master Node Client Lock Request Latency Client across CI star coupler: 440 microseconds Lock Master Client node Client Star Coupler Storage Lock Request Latencies 500 450 400 350 300 250 200 150 100 50 0 440 333 270 285 230 94 120 4 Latency (micro-seconds) Local node Galaxy SMCI MC 2 Gigabit Ethernet FDDI GS-FDDI-GS FDDI GS-ATM-GS DSSI CI How to measure lock rates VMS keeps counters of lock activity for each resource tree So you can see the lock rate for an RMS indexed file, for example but not for each of the sub-resources but not for individual buckets or records within that file SDA extension LCK can trace all lock requests if needed Identifying busiest lock trees in the cluster with a program Measure lock rates based on RSB data: Follow chain of root RSBs from LCK$GQ_RRSFL listhead via RSB$Q_RRSFL links Root RSBs contain counters: RSB$W_OACT: Old activity field (average lock rate per 8 second interval) Divide by 8 to get per-second average RSB$W_NACT: New activity (locks so far within current 8-second interval) Transient value, so not as useful Identifying busiest lock trees in the cluster with a program Look for non-zero OACT values: Gather resource name, master node CSID, and old-activity field Do this on each node Summarize data across the cluster See example procedure LOCK_ACTV.COM and program LCKACT.MAR Or, for VMS 7.2-2 and above: SDA> LCK SHOW ACTIVE Note: Per-node data, not cluster-wide summary Lock Activity Program Example 0000002020202020202020203153530200004C71004624534D52 RMS$F.qL...SS1 RMS lock tree for file [70,19569,0] on volume SS1 File specification: DISK$SS1:[DATA8]PDATA.IDX;1 Total: 11523 *XYZB12 6455 XYZB11 746 XYZB14 611 XYZB15 602 XYZB23 564 XYZB13 540 XYZB19 532 XYZB16 523 XYZB20 415 XYZB22 284 XYZB18 127 XYZB21 125 * Lock Master Node for the resource {This is a fairly hot file. Here the lock master node is optimal.} ... Lock Activity Program Example 0000002020202032454C494653595302000000D3000C24534D52 RMS$.......SYSFILE2 RMS lock tree for file [12,211,0] on volume SYSFILE2 File specification: DISK$SYSFILE2:[SYSFILE2]SYSUAF.DAT;5 Total: 184 XYZB16 75 XYZB20 48 XYZB23 41 XYZB21 16 XYZB19 2 *XYZB15 1 XYZB13 1 XYZB14 0 XYZB12 0 ... {This reflects user logins, process creations, password changes, and such. Note the poor lock master node selection here (XYZB16 would be optimal).} Example: Application (re)opens file frequently Symptom: High lock rate on File Access Arbitration Lock for application data file Cause: BASIC program re-executing OPEN command for a file; BASIC dutifully closes and then re-opens file Fix: Modify BASIC program to execute OPEN statement only once at image startup time Lock Activity Program Example 00000016202020202020202031505041612442313146 F11B$aAPP1 .... Files-11 File Access Arbitration lock for file [22,*,0] on volume APP1 File specification: DISK$APP1:[DATA]XDATA.IDX;1 Total: 50 *XYZB15 8 XYZB21 7 XYZB16 7 XYZB19 6 XYZB20 6 XYZB23 6 XYZB18 5 XYZB13 3 XYZB12 1 XYZB22 1 XYZB14 1 {This shows where the application is apparently opening (or re-opening) this particular file 50 times per second.} Lock Mastership (Resource Mastership) concept One lock master node is selected by VMS for a given resource tree at a given time Different resource trees may have different lock master nodes Lock Mastership (Resource Mastership) concept Lock master remembers all locks on a given resource tree for the entire cluster Each node holding locks also remembers the locks it is holding on resources, to allow recovery if lock master node dies Lock Mastership Lock mastership node may change for various reasons: Lock master node goes down -- new master must be elected VMS may move lock mastership to a “better” node for performance reasons LOCKDIRWT imbalance found, or Activity-based Dynamic Lock Remastering Lock Master node no longer has interest Lock Remastering Circumstances under which remastering occurs, and does not: LOCKDIRWT values VMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT Shifting initiated based on activity counters in root RSB PE1 parameter being non-zero can prevent movement or place threshold on lock tree size Shift if existing lock master loses interest Lock Remastering VMS rules for dynamic remastering decision based on activity levels: assuming equal LOCKDIRWT values 1) Must meet general threshold of 80 lock requests so far (LCK$GL_SYS_THRSH) 2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH) Lock Remastering VMS rules for dynamic remastering: 3) Estimated cost to move (based on size of lock tree) must be less than estimated savings (based on lock rate) except if new master meets criteria (2) for 3 consecutive 8-second intervals, cost is ignored 4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA) Lock Remastering VMS rules for dynamic remastering: 5) If PE1 on the current master has a negative value, remastering trees off the node is disabled 6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered Lock Remastering Implications of dynamic remastering rules: LOCKDIRWT must be equal for lock activity levels to control choice of lock master node PE1 can be used to control movement of lock trees OFF of a node, but not ONTO a node RSB stores lock activity counts, so even high activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated Lock Remastering Implications of dynamic remastering rules: With two or more large CPUs of equal size running the same application, lock mastership “thrashing” is not uncommon: 10 more lock requests per second is not much of a difference when you may be doing 100s or 1,000s of lock requests per second Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload Lock Remastering Lock mastership thrashing results in user-visible delays Lock operations on a tree are stalled during a remaster operation Locks and Resources were sent over 1 per SCS message Remastering large lock trees could take a long time e.g. 10 to 50 seconds for 15K lock tree size, prior to 7.2-2 Improvement in VMS in version 7.2-2 and above gives very significant performance gain by using 64 Kbyte block data transfers instead of sending 1 SCS message per RSB or LKB How to Detect Lock Mastership Thrashing Detection of remastering activity MONITOR RLOCK in 7.3 and above (not 7.2-2) SDA> SHOW LOCK/SUMMARY in 7.2 and above Change of mastership node for a given resource Check message counters under SDA: SDA> EXAMINE PMS$GL_RM_RBLD_SENT SDA> EXAMINE PMS$GL_RM_RBLD_RCVD Counts which increase suddenly by a large amount indicate remastering of large tree(s) SENT: Off of this node RCVD: Onto this node See example procedures WATCH_RBLD.COM and RBLD.COM How to Prevent Lock Mastership Thrashing Unbalanced node power Unequal workloads Unequal values of LOCKDIRWT Non-zero values of PE1 How to find the lock master node for a given resource tree 1) Take out a Null lock on the root resource using $ENQ VMS does directory lookup and finds out master node 2) Use $GETLKI to identify the current lock master node’s CSID and the lock count If the local node is the lock master, and the lock count is 1 (i.e. only our NL lock), there’s no interest in the resource now How to find the lock master node for a given resource tree 3) $DEQ to release the lock 4) Use $GETSYI to translate the CSID to an SCS Nodename See example procedure FINDMASTER_FILE.COM and program FINDMASTER.MAR, which can find the lock master node for RMS file resource trees Controlling Lock Mastership Lock Remastering is a good thing Maximizes the number of lock requests which are local (and thus fastest) by trying to move lock mastership of a tree to the node with the most activity on that tree So why would you want to wrest control of lock mastership away from VMS? Spread lock mastership workload more evenly across nodes to help avoid saturation of any single lock master node Provide best performance for a specific job by guaranteeing local locking for its files How to force lock mastership of a resource tree to a specific node 3 ways to induce VMS to move a lock tree: 1) Generate a lot of I/Os For example, run several copies of a program that rapidly accesses the file 2) Generate a lot of lock requests without the associated I/O operations 3) Generate the effect of a lot of lock requests without actually doing them by modifying VMS’ data structures How to force lock mastership of a resource tree to a specific node We’ll examine: 1) Method using documented features thus fully supported 2) Method modifying VMS data structures Controlling Lock Mastership Using Supported Methods To move a lock tree to a particular node (non-invasive method): Assume PE1 non-zero on all nodes to start with 1) Set PE1 to 0 on existing lock master node to allow dynamic lock remastering of tree off that node 2) Set PE1 to negative value (or small positive value) on target node to prevent lock tree from moving off of it afterward Controlling Lock Mastership Using Supported Methods 3) On target node, take out a Null lock on root resource 4) Take out a sub-lock of the parent Null lock, and then repeatedly convert it between Null and some other mode Check periodically to see if tree has moved yet (using $GETLKI) 5) Once tree has moved, free locks 6) Set PE1 back to original value on former master node Controlling Lock Mastership Using Supported Methods Pros: Uses only supported interfaces to VMS Cons: Generates significant load on existing lock master, from which you may have been trying to off-load work. In some cases, node may thus be saturated and unable to initiate lock remastering Programs running locally on existing lock master can generate so many requests that tree won’t move because you can’t generate nearly as many lock requests remotely See example program LOTSALOX.MAR Controlling Lock Mastership By Modifying VMS Data Structures Goal: Reproduce effect of lots of lock requests without the overhead of the lock requests actually occurring General Method: Modify activity-related counts and remastering-related fields and flags in root RSB to persuade VMS to remaster the resource tree Controlling Lock Mastership By Modifying VMS Data Structures 1) Run program on node which is presently lock master 2) Use $GETSYI to get CSID of desired target node, given nodename 3) Lock down code and data 4) $CMKRNL, raise IPL, grab LCKMGR spinlock Controlling Lock Mastership By Modifying VMS Data Structures 5) Starting at LCK$GQ_RRSFL listhead, follow chain of root RSBs via RSB$Q_RRSFL links 6) Search for root RSB with matching resource name, access mode, and group (0=System) Controlling Lock Mastership By Modifying VMS Data Structures 7) Set up to trigger remaster operation: Set RSB$L_RM_CSID to target node‘s CSID Set RSB$B_LSTCSID_IDX to low byte of target node’s CSID Set RSB$B_SAME_CNT to 3 or more so remastering occurs regardless of cost Controlling Lock Mastership By Modifying VMS Data Structures Zero our activity counts RSB$W_OACT and RSB$W_NACT so local lock rate seems low Set new-master activity count RSB$W_NMACT to maximum possible (hex FFFF) to simulate tons of locking activity Set RSB$M_RM_PEND flag in RSB$L_STATUS field to indicate a remaster operation is now pending 8) Release LCKMGR spinlock, lower IPL, and let VMS do its job Controlling Lock Mastership By Modifying VMS Data Structures Problem (for all methods): Once PE1 is set to zero to allow the desired lock tree to migrate, other lock trees may also migrate, unwanted Solution: To prevent this, in all other resource trees mastered on this node: Clear RM_PEND flag in L_STATUS if set, and Set W_OACT and W_NACT to max. (hex FFFF) Zero W_NMACT, L_RM_CSID, B_LSTCSID_IDX, and B_SAME_CNT Controlling Lock Mastership By Modifying VMS Data Structures Pros: Does the job reliably Can avoid other resource trees “escaping” Cons: High-IPL code presents some level of risk of crashing a system See example program REMASTER.MAR One might instead use (in 7.2-2 & above) SDA> LCK REMASTER Causes of lock queues Program bug (e.g. not freeing a record lock) I/O or interconnect saturation “Deadman” locks How to detect lock queues Using DECamds / Availability Manager Using SDA Using other methods Lock contention & DECamds DECamds can identify lock contention if a lock blocks others for 15 seconds AMDS$LOCK_LOG.LOG file in AMDS$SYSTEM: contains a log of occurrences of suspected contention Resource name decoding techniques shown earlier can sometimes be used to identify the file involved Deadman locks can be filtered out Detecting Lock Queues with ANALYZE/SYSTEM (SDA) New qualifier added to SHOW RESOURCE command in SDA for 7.2 and above: SHOW RESOURCE/CONTENTION shows blocking and blocked lock requests New qualifier was added to SHOW LOCK command in SDA for 7.2 and above: SHOW LOCK/WAITING displays blocked lock requests (but then you must determine what’s blocking them) Detecting Lock Queues with a program Traverse lock database starting with LCK$GQ_RRSFL listhead and following chain of root RSBs via RSB$Q_RRSFL links Within each resource tree, follow RSB$Q_SRSFL chain to examine all sub-resources, recursively Detecting Lock Queues with a program Check the Wait Queue (RSB$Q_WTQFL and RSB$Q_WTQBL) Check the Convert Queue (RSB$Q_CVTQFL and RSB$Q_CVTQBL) If queues are found, display: Queue length(s) Resource name Resource names for all parent locks, up to the root lock See example DCL procedure LCKQUE.COM and program LCKQUE.MAR Example: Directory File Grows Large Symptom: High queue length on file serialization lock for .DIR file Cause: Directory file has grown to over 127 blocks (VMS version 7.1-2 or earlier; 7.2 and later are much less sensitive to this problem) Fix: Delete or rename files out of directory Lock Queue Program Example Here are examples where a directory file got very large under 7.1-2: 'F11B$vAPP2 ' 202020202020202032505041762442313146 Files-11 Volume Allocation lock for volume APP2 'F11B$sH...' 00000148732442313146 Files-11 File Serialization lock for file [328,*,0] on volume APP2 File specification: DISK$APP2:[]DATA.DIR;1 Convert queue: 0, Wait queue: 95 'F11B$vLOGFILE ' 2020202020454C4946474F4C762442313146 Files-11 Volume Allocation lock for volume LOGFILE 'F11B$s....' 00000A2E732442313146 Files-11 File Serialization lock for file [2606,*,0] on volume LOGFILE File specification: DISK$LOGFILE:[000000]LOGS.DIR;1 Convert queue: 0, Wait queue: 3891 Example: Fragmented File Header Symptom: High queue length on File Serialization Lock for application data file Cause: CONVERTs onto disk without sufficient contiguous space resulted in highly-fragmented files, increasing I/O load on disk array. File was so fragmented it had 3 extension file headers Fix: Defragment disk, or do an /IMAGE Backup/Restore Lock Queue Program Example Here's an example of the result of reorganizing RMS indexed files with $CONVERTs over a weekend without enough contiguous free space available, causing a lot of file fragmentation, and dramatically increasing the I/O load on a RAID array on the next busy day (we had to fix this with a backup/restore cycle soon after). The file shown here had gotten so fragmented as to have 3 extension file headers. The lock we're queueing on here is the file serialization lock for this RMS indexed file: 'F11B$s....' 0000000E732442313146 Files-11 File Serialization lock for file [14,*,0] on volume THDATA File specification: DISK$THDATA:[TH]OT.IDX;1 Convert queue: 0, Wait queue: 28 Future Directions for this Investigation Work Concern: Locking down remastering with PE1 (to avoid lock mastership thrashing) can result in sub-optimal lock master node selections over time Future Directions for this Investigation Work Possible ways of mitigating side-effects of preventing remastering using PE1: Adjust PE1 value as high as you can without producing noticeable delays Upgrade to 7.2-2 or above for more-efficient remastering Set PE1 to 0 for short periods, periodically Raise fixed threshold values in VMS data cells LCK$GL_SYS_THRSH and particularly LCK$GL_ACT_THRSH More-invasive automatic monitoring and control of remastering activity Enhancements to VMS itself How to measure pent-up remastering demand While PE1 is set to prevent remastering, sub-optimal lock mastership may result VMS will “want” to move some lock trees but cannot See example procedure LCKRM.COM and program LCKRM.MAR, which measure pent-up remastering demand How to measure pent-up remastering demand LCKRM example: Time: 16:19 ----- XYZB12: ----- 'RMS$..I....SS1 ...' 000000202020202020202020315353020000084900B424534D52 RMS lock tree for file [180,2121,0] on volume SS1 File specification: DISK$SS1:[PDATA]PDATA.IDX;1 Pent-up demand for remaster operation is pending to node XYZB18 (CSID 00010031) Last CSID Index: 34, Same-count: 0 Average lock rates: Local 44, Remote 512 Status bits: RM_PEND Interrupt-state/stack saturation Too much lock mastership workload can saturate primary CPU on a node See with MONITOR MODES/CPU=0/ALL Interrupt-state/stack saturation FAST_PATH: Can shift interrupt-state workload off primary CPU in SMP systems IO_PREFER_CPUS value of an even number disables CPU 0 use Consider limiting interrupts to a subset of non-primaries FAST_PATH for CI since 7.0 FAST_PATH for MC “never” FAST_PATH for SCSI and FC is in 7.3 and above FAST_PATH for LANs (e.g. FDDI & Ethernet) slated for 7.3-1 Even with FAST_PATH enabled, CPU 0 still receives the device interrupt, but hands it off immediately via an interprocessor interrupt 7.3-1 is slated to allow FAST_PATH interrupts to bypass CPU 0 entirely and go directly to a non-primary CPU Dedicated-CPU Lock Manager With 7.2-2 and above, you can choose to dedicate a CPU to do lock management work. This may help reduce MP_SYNC time. LCKMGR_MODE parameter: 0 = Disabled >1 = Enable if at least this many CPUs are running LCKMGR_CPUID parameter specifies which CPU to dedicate to LCKMGR_SERVER process Example programs Programs referenced herein may be found: On the VMS Freeware V5 CD, under directories [KP_LOCKTOOLS] or [KP_CLUSTERTOOLS] or on the web at: http://www.openvms.compaq.com/freeware/freeware50/kp_clustertools/ http://www.openvms.compaq.com/freeware/freeware50/kp_locktools/ New additions & corrections may be found at: http://encompasserve.org/~parris/ Example programs Copies of this presentation (and others) may be found at: http://www.geocities.com/keithparris/ Questions? Speaker Contact Info: Keith Parris E-mail: [email protected] or [email protected] or [email protected] Web: http://encompasserve.org/~parris/ and http://www.geocities.com/keithparris/