Download Tivoli Workload Scheduler: Troubleshooting Guide

Fault-tolerant agent problems
The following problems could be encountered with fault-tolerant agents.
v “A job fails in heavy workload conditions”
v “Batchman, and other processes fail on a fault-tolerant agent with the message
v “Fault-tolerant agents unlink from mailman on a domain manager” on page 111
A job fails in heavy workload conditions
A job fails on a fault-tolerant agent where a large number of jobs are running
concurrently and one of the following messages is logged:
v “TOS error: No space left on device.”
v “TOS error: Interrupted system call.”
Cause and solution:
This problem could indicate that one or more of the CCLog properties has been
inadvertently set back to the default values applied in a prior version (which used
to occasionally impact performance).
See “Tivoli Workload Scheduler logging and tracing using CCLog” on page 14 and
check that the file contains the indicated default values for
the properties twsHnd.logFile.className and twsloggers.className.
If the correct default values are being used, contact IBM Software Support to
address this problem.
Batchman, and other processes fail on a fault-tolerant agent
with the message AWSDEC002E
The batchman process fails together with all other processes that are running on
the fault-tolerant agent, typically mailman and jobman (and JOBMON on Windows
2000). The following errors are recorded in the stdlist log of the fault-tolerant
AWSBCV012E Mailman cannot read a message in a message file.
The following gives more details of the error:
AWSDEC002E An internal error has occurred. The following UNIX
system error occurred on an events file: "9" at line = 2212
Cause and solution:
The cause is a corruption of the file Mailbox.msg, probably because the file is not
large enough for the number of messages that needed to be written to it.
Consider if it seems likely that the problem is caused by the file overflowing:
v If you are sure that this is the cause, you can delete the corrupted message file.
All events lost: Following this procedure means that all events in the corrupted
message file are lost.
Perform the following steps:
