Download Environmental Events

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Immunity-aware programming wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
Dell™ OpenManage™ IT
Assistant: Understanding
Events -- How to Select Events
for Monitoring
Enterprise Systems Group (ESG)
Dell™ OpenManage™
Systems Management
Dell White Paper
By Manoj Gujarathi
Systems Engineer
OpenManage Development
August 2001
Contents
Introduction ......................................................................................................................4
IT Assistant Event Management System (EMS) Overview .....................................5
Event Categories/Types and Event Source Organization ........................................6
DMI Indications ........................................................................................................6
SNMP Traps ...............................................................................................................6
Understanding the Events in IT Assistant ..................................................................9
Cluster Events ............................................................................................................9
Environmental Events ............................................................................................10
Cooling Device Events (Fans, Blowers) ............................................................................ 11
Temperature Sensor Events ............................................................................................... 11
Memory Events ........................................................................................................11
Dell Instrumentation SNMP traps ..................................................................................... 12
Dell Instrumentation DMI Indications: ............................................................................ 12
Network Events .......................................................................................................12
Operating System Events ......................................................................................13
Other Events.............................................................................................................14
Events from Adaptec CI/O Agent ..................................................................................... 14
Events from Qlogic Agen ................................................................................................... 14
Instrumentation Events....................................................................................................... 15
Power Events ............................................................................................................15
Battery Events ...................................................................................................................... 15
Electric Current Events ....................................................................................................... 16
Voltage Events ..................................................................................................................... 16
Power Supply Events .......................................................................................................... 16
Processor Events ......................................................................................................17
Security Events ........................................................................................................19
Software Events .......................................................................................................19
Events from Instrumentation Agents................................................................................ 20
Events from IT Assistant..................................................................................................... 20
Storage Events..........................................................................................................20
Best Practices While Selecting the Events to be Monitored ..................................23
Conclusion ......................................................................................................................25
Table 1: Agent Applications supporting SNMP and/or DMI .................................................. 7
Table 2: Agent Applications supporting SNMP and/or DMI ................................................ 10
Table 3: Environmental Event Agents ...................................................................................... 10
Table 4: Memory Event Agents ................................................................................................. 11
Table 5: ECC Single Bit Error Counts........................................................................................ 12
Table 6: Network Event Agents ................................................................................................. 12
August 2001
Page 2
Dell Enterprise Systems Group
Table 7: Operating System Event Agents ................................................................................. 13
Table 8: Other Events Agents ..................................................................................................... 14
Table 9: Power Events Agents.................................................................................................... 15
Table 10: Processor Events Agents ............................................................................................ 17
Table 11: Security Events Agents............................................................................................... 19
Table 12: Software Events Agents ............................................................................................. 19
Table 13: Storage Events Agents ................................................................................................ 20
Table 14: Critical Events in the Storage Categories ................................................................. 21
Table 15: Baseline Event Types Associated with Critical Failures ........................................ 23
August 2001
Page 3
Dell Enterprise Systems Group
Section
1
Introduction
Dell OpenManage IT Assistant is a browser-based tool that monitors and
manages Dell servers, desktops, and portables using industry standard Simple
Network Management Protocol (SNMP), Desktop Management Interface (DMI),
and Common Information Model (CIM) protocols. IT Assistant provides a broad
set of features that are designed to help system administrators carry out
important system management operations in a heterogeneous environment of
Dell systems. The IT Assistant feature set includes system discovery and status
reporting, comprehensive event management, asset and inventory reporting,
remote system configuration and storage management. (For complete details on
the features of IT Assistant, please refer to IT Assistant User’s Guide.)
This paper addresses the following topics in detail:
 Predefined Event Categories in IT Assistant
 Detailed description of the critical events (SNMP traps or DMI indications)
logged by Dell Agents
 Best practices in configuring events and key important critical events (that
Dell recommends to monitor to take action on)
This paper has been written to help administrators with monitoring and
managing Dell systems that have different Dell-supported agents. The topics
presented here throw light on the IT Assistant Event Management System and go
into detail on the pre-populated events: what they mean and how to decide
which events to select for monitoring. This paper will also help administrators
using Dell OpenManage Connections to gain more insight on the events logged
by Dell Agents.
August 2001
Page 4
Dell Enterprise Systems Group
Section
2
IT Assistant Event Management System
Overview
The events described in this paper are based on the pre-populated categories in
IT Assistant Event Management System (EMS). The EMS in IT Assistant is a
versatile and powerful tool that allows administrators to monitor specific
systems for certain types of events occurring at the selected times of the day, and
take pre-defined actions when those events occur. EMS allows administrators to
create filters to monitor events from different Dell systems -- including servers as
well as clients (managed system generating the event). These filters are
configurable based on severity, source node name, and time period, and allow
administrators to associate actions with each filter. For more details on this and
how to configure EMS for creating filters and actions, please refer to the
Configuring and Using the Dell OpenManage IT Assistant Event Management System
Dell OpenManage white paper by Ross Burns.
August 2001
Page 5
Dell Enterprise Systems Group
Section
3
Event Categories/Types and Event Source
Organization
The Event Categories in IT Assistant EMS allows users to look at the pre-defined
event categories and the Event Types for all the events (traps and indications)
pre-populated in IT Assistant. These category group names are: Cluster,
Environmental, Memory, Network, Operating System, Other, Power, Processor,
Security, Software, and Storage. The Event Types are the actual events generated
by one or more agents that can be monitored by IT Assistant. Users can rename
the event type or even change the names of the predefined categories for
simplicity and customization. Note: Users will have to remove and add the Event
Type assigned to the filter after a name change.
The events can consist of SNMP traps and DMI indications. These are
generically called ‘Events’ or ‘Alerts’ interchangeably throughout IT Assistant
documents and user interface screens. To find out the source of the event, select
the event type and click ‘Edit’ button. There could be different agents generating
the same event having different types – SNMP or DMI to indicate that the event
is an SNMP trap or a DMI indication respectively.
DMI Indications
In order for IT Assistant to receive DMI events, the managed node has to
“register” with the IT Assistant management station. When IT Assistant
discovers the node, it automatically does so through the Remote Procedure Call
(RPC) mechanism.
SNMP Traps
Like DMI events, it is not sufficient for IT Assistant just to discover the managed
node system to receive SNMP events. Users must configure the SNMP service
on a managed node to create a community and discover that node through IT
Assistant using that community name, and also create trap destinations for the IT
Assistant management system to receive those traps. Note: The SNMP service
must be restarted to take the change into effect.
Table 1 provides the list of agents sending indications/traps to IT Assistant.
August 2001
Page 6
Dell Enterprise Systems Group
Table 1: Agent Applications supporting SNMP and/or DMI
Agent Application
SNMP Traps Support
DMI Indication
Support
Broadcom Agent

DMTF

Dell Array Manager Agent

Dell DRAC2 Card and Agent

Dell OpenManage Client
Instrumentation
Dell OpenManage HIP



Dell OpenManage Server
Agent
Dell OpenManage IT
Assistant
Dell Remote Assistant
Server
Fiber Channel Switch Agent


Giganet Agent

Netware

NuView ClusterX and
Veritas ClusterX Agent
RAID Agent (PERC and
PERC2)
SCSI Agent (CIO)

SNMP Agent Traps

Veritas ClusterX Agent

Windows


(only physical container
global table)





Adaptec CI/O Agent

Dell OpenManage Client
Instrumentation
Intel NIC Instrumentation

Qlogic Agent

Symbios Agent


The agents in the above table are either applications by themselves, or installed by one or more Dell
applications.
Distributed Management Task Force (DMTF) Tables include -- Cooling Device, Disk Controller,
Disks, Electrical Current Probe, Indications, Logical Memory, Mass Storage Logical Drives,
Motherboard, Physical Container, Physical Memory Area, Portable Battery, Power Supply, Power
Unit, Processor, Structure Dependency, System Cache, System Hardware Security, System Reset,
Temperature Probe, UPS Battery, and Voltage Probe.
August 2001
Page 7
Dell Enterprise Systems Group
For event monitoring, certain agents are managed only through DMI (e.g. Dell
OpenManage Client Instrumentation) or through SNMP (e.g. Dell Array Manage
Agent) or both (Dell OpenManage Server Agent). The events from DRAC2
Agent support in-band SNMP (originating from the system) while the DRAC2
Card supports out-of-band SNMP (originating from card itself).
For more details on the Agent versions and IT Assistant versions supporting
those events, see the Dell OpenManage white paper: Configuring and Using the
Dell OpenManage IT Assistant Event Management System by Ross Burns. Also
check out the IT Assistant Database Management Utility (dcdbmng.exe) –
shipped with IT Assistant – on how the events are pre-populated.
August 2001
Page 8
Dell Enterprise Systems Group
Section
4
Understanding the Events in IT Assistant
There are close to 800 events in the IT Assistant database, so it can be very
difficult to determine what each event type means to understand which events to
monitor. This is especially difficult in some event categories where the event type
names are similar and the events differ only slightly. This section addresses this
topic, and provides details according to pre-existing categories as defined in IT
Assistant.
The following are some important points to remember about the pre-populated
event types:
 While describing events, more focus is put on the events related to Dell
Instrumentation Agents and, where possible, the event type (traps or
indications) is mentioned.
 The current Dell Server Instrumentation shipping is Dell OpenManage
Server Agent (Version 4.3), while the earlier agent was called Dell
OpenManage Hardware Instrumentation Package (HIP). Dell OpenManage
Client Instrumentation (OMCI) is for client systems only.
 The Dell OpenManage Server Agent instrumentation events up to version 4.3
are supported through SNMP as well as DMI. In upcoming version 4.4, only
SNMP is supported. OMCI 5.x, 6.0 events are supported through DMI only.
 Certain event types starting with DMTF are DMI indications converted to
SNMP traps by DMI to SNMP mapper. There is more information on these
events in DMTF documents.
 To fine a description attached to each event type and the source name, select
that event type and click on the ‘Edit’ button.
 Because of the large number of pre-populated events in IT Assistant, only
important/critical events for monitoring are described in this paper.
In the following sections, the important events under each category are
examined.
Cluster Events
The events in this category are generated by the agents listed in Table 2, along
with the type of event.
August 2001
Page 9
Dell Enterprise Systems Group
Table 2: Agent Applications supporting SNMP and/or DMI
Event Sources
Dell OpenManage Cluster Assistant with ClusterX Application v. 2.x
(Source Names: NuView ClusterX)
Dell OpenManage Cluster Assistant with ClusterX Application v. 3.x
(Source Names: Veritas ClusterX)
Event Types
SNMP traps
SNMP traps
The events in this category are SNMP Traps generated by the Dell OpenManage
Cluster Assistant with ClusterX Application. The events show the event source
as NuView ClusterX or Veritas ClusterX, depending on if the trap is generated
by Dell OpenManage Cluster Assistant version 2.x or 3.x respectively. The event
types starting with ‘WLBS’ (Windows Load Balancing Service) would not be
generated simply by the basic Dell OpenManage Cluster Assistant with ClusterX
Application and would be available if you upgrade to ClusterX application from
Veritas. The following are the details on the critical events:




Failure of a node in an MSCS cluster/Failure of an MSCS
cluster – These events are generated when a node in Microsoft Cluster
Server (MSCS) is failed (e.g. because of system crash), or the whole cluster is
down (say because of storage system is down) respectively.
Failure of a resource detected – When any resource like a disk, or
an application like Exchange is failed.
Detection of a failure of the private or public cluster
interconnects – When the public or private interconnects like cluster
heartbeat fails.
Detected the cluster service wrote a critical event to
the NT event log – When a monitored resource writes a critical event in
the NT event log. This could be monitored to get critical updates from the
cluster resources.
Please refer to Dell OpenManage Cluster Assistant with ClusterX Application
documents for additional details.
Environmental Events
The events in this category are generated by the agents listed in Table 3, along
with the type of event.
Table 3: Environmental Event Agents
Event Sources
DMTF (through mapper)
Dell OpenManage HIP/Server Agent
DRAC
Event Types
SNMP traps
SNMP traps, DMI indications
SNMP traps
This category consists of the system environmental events related to cooling
devices (fans, blowers), temperature, current, voltage sensor/probes etc. Dell
August 2001
Page 10
Dell Enterprise Systems Group
Instrumentation Agents and DRAC agents generate the events. The following
events are categorized based on the failing device.
Cooling Device Events (Fans, Blowers)

Cooling Device Failure, Warning, Normal – these events occur
when the fan sensor exceeds its failure or warning threshold for one or more
devices. A normal event is logged when the fan sensor for one or more
devices returns to a valid range after crossing the warning or failure
thresholds. Note: The normal event is logged by Server Agent only, for HIP agent,
you may need to monitor ‘Fan Failure returned to Normal’ or ‘Fan Warning
returned to Normal’.
 Fan Enclosure insertion/removal traps can be monitored to discover
interventions in the system fan assembly. Only certain Dell servers –
including the PowerEdge 4350, 6350 and 6450 – support Fan Enclosure
Extended Removal, and the system may shut down as a result of it.
The above are SNMP traps, and details such as device location and readings are provided
with the traps.
 Cooling Device Status Change events for fan, Cooling Device
Status Change – Critical (Fan) – can be monitored to detect the
change in status of fan while using Server instrumentation through DMI.
Temperature Sensor Events

Temperature Failure/Warning, Temperature Failure/Warning
returned to Normal – occur when temperature sensor in the backplane
board, system board or the drive carrier in the specified system exceeds its
failure/warning threshold. Crossing failure threshold could lead the system
to shut down. A normal event is logged when the temperature sensor
returns to a valid range after crossing such threshold.
The above events are SNMP traps and details such as location and the sensor readings
are provided with the trap.
 Temperature Fault – Critical, Non Critical, Non
Recoverable – are DMI indications equivalent to above described traps
generated by Dell server instrumentation.
DRAC2 temperature events are generated by DRAC2 agent and can be
monitored if the managed node has a DRAC2 card configured.
Memory Events
The events in this category are generated by the agents listed in Table 4, along
with the type of event.
Table 4: Memory Event Agents
Event Sources
DMTF (through mapper)
Dell OpenManage HIP/Server Agent
August 2001
Page 11
Event Types
SNMP traps
SNMP traps, DMI indications
Dell Enterprise Systems Group
Dell OpenManage Client Instrumentation
DMI indications
As the category name suggests, the events populated here are related to the
system memory and are generated by Dell Instrumentation and DMTF tables.
Dell Server Instrumentation generates traps as well as indications and if you are
monitoring servers only through SNMP or DMI, you can monitor the common
subset of these events to avoid duplication and confusion.
Dell Instrumentation SNMP traps
Memory Device Warning/Failure/Non Recoverable – These are
memory ECC error traps. These occur when the memory device pre-failure
sensor (which monitors memory modules and detects when memory is about to
fail) exceeds the warning/critical /non-recoverable thresholds. These thresholds
are defined by ECC single bit error counts and are defined in Table 5.
Table 5: ECC Single Bit Error Counts
ECC single bit error count more than:
2
10
20
Event
Memory Device Warning
Memory Device Failure (Critical)
Memory Device Non Recoverable
Please note that if you get an event Memory Device Non Recoverable, it
does not mean that system memory stopped functioning. It is a signal that there
is a severe problem with the memory, or the hardware or software using it.
Dell Instrumentation DMI Indications:


Memory ECC Errors – are the mapped from DMTF
SystemChassisExtension table while Memory Errors -- are mapped from
DMTF Physical Memory Array table. The ECC errors equivalent to above
described SNMP events are Memory Errors event types.
Memory size increased or decreased – Dell OpenManage Client
Instrumentation logs these only when the changes in memory size are
detected.
Network Events
The events in this category are generated by the agents listed in Table 6, along
with the type of event.
Table 6: Network Event Agents
Event Sources
DMTF (through mapper)
Intel NIC Instrumentation
August 2001
Event Types
SNMP traps
DMI indications
Page 12
Dell Enterprise Systems Group
Event Sources
Giganet CLAN agent
Broadcom Agent
SNMP Agent
Event Types
SNMP traps
SNMP traps
SNMP traps
Following are the important Intel NIC events that could be monitored to get
critical update on NIC status.
 Adapter initialization failure – Failure to open a handle to
adapter miniport driver because of initialization failure.
 Intel NIC Link Down – when the network media state is disconnected
 LAN Controller hardware Failure – when the hardware status is not
ready (because of failure)
Note: Intel NIC Link Down, Line Down and Cable unplugged/No LAN
activity are the same events. Also note that the S/W error event is no longer
generated.
 The following Intel NIC event types are related to teaming NICs and need to
be monitored if you have teamed NIC configuration:
The last Adapter has lost link. Network connection has
been lost, Preferred Primary Adapter has been detected,
The team only has one active adapter, Preferred Primary
Adapter has taken over
 NIC Failover Event – is generated by Broadcom NIC agent while the
event types starting with ‘CLAN’ are generated by Giganet CLAN agent.
 Please refer to the documents related to the Intel NIC, Broadcom, Giganet
CLAN and SNMP agents for more information.
Operating System Events
The events in this category are generated by the agents listed in Table 7, along
with the type of event.
Table 7: Operating System Event Agents
Event Sources
SNMP Agent
Windows OS
Event Types
SNMP traps
SNMP traps
There are only two event types in this category – SNMP Cold Start and
SNMP Warm Start – and these are generic SNMP Events logged by SNMP
Agent, Windows OS as well as Linux OS (in ITA 6.1). A Cold Start trap signifies
that the sending protocol entity is reinitializing itself such that the agent's
configuration or the protocol entity implementation may be altered; this trap is
generated mostly due to system crash or restart.
A Warm Start trap is generated when SNMP reinitializes without altering the
agent configuration. This is mostly because of normal restart.
In some cases it is important to monitor the ‘cold start’ trap to know any
inadvertent unintended re-initializations.
August 2001
Page 13
Dell Enterprise Systems Group
Other Events
The events in this category are generated by the agents listed in Table 8, along
with the type of event.
Table 8: Other Events Agents
Event Sources
DMTF (through mapper)
Adaptec CI/O
Qlogic
Dell OpenManage HIP/Server Agent
DRAC
Event Types
SNMP traps
DMI indications
DMI indications
SNMP traps, DMI indications
SNMP traps
The remaining events that don’t fit into other pre-defined categories are included
in this category. The events consist of Qlogic NIC events, Adaptec CI/O events
and some Dell Instrumentation events.
Events in this category are described according to the source agent. This will
help you in deciding whether to monitor these events depending on if you are
using that agent on the managed node.
Events from Adaptec CI/O Agent





Bus Port Error – this event occurs because of errors in bus port, which is
the attachment point for the devices connecting to the bus.
Enclosure CI/O Event – this event occurs because of error event
regarding entity’s enclosure devices.
Existing Object is Gone, Existing Object Replaced -- These
events are associated with Mass Storage Association, a DMTF group defining
the relationship between various components of the storage system.
Volume Set Events -- All these events are defined from DMTF Volume Set
Group, which is a contiguous block of logical block addresses for reading
and writing user data.
These events are DMI indications. Please refer to DMTF documents for more
details on these events.
Events from Qlogic Agen
Adapter Error, Adapter Warning, Unknown Adapter Event -These are the critical errors to be monitored if the node is using the Qlogic agent.
These events are DMI indications.
August 2001
Page 14
Dell Enterprise Systems Group
Instrumentation Events

Container Security Breach, Logical Device Status Change,
Physical Device Status Change – These events are related to the
status change in the system containers like chassis, sub-chassis, expansion
chassis etc, mapped from DMTF Physical Container Global Table. All these
events are DMI indications.
Redundancy Degraded– The redundancy unit sensor in the main chassis
detected that one of the units of redundancy has failed, but the overall unit is
still redundant.
Redundancy Lost – When one of the components in the redundancy unit
is disconnected or failed or is not present. You can monitor both of these
events for Fans, Power Supplies etc. All redundancy events are SNMP traps.
Note: The redundancy units in the system could be power supply, fan, AC cord etc.
Add this event while monitoring these components.
Thermal Shut down – This is generated when the system is configured for
thermal shutdown due to an error event – like temperature sensor exceeding
the error threshold.



Power Events
The events in this category are generated by the agents listed in Table 9, along
with the type of event.
Table 9: Power Events Agents
Event Sources
DMTF (through mapper)
Dell OpenManage HIP/Server Agent
DRAC2
RAID
Event Types
SNMP traps
SNMP traps, DMI indications
SNMP traps
SNMP traps
This important category consists of many different events related to the power
supply, battery, voltage, current, and temperature coming from all Dell
instrumentation agents, Dell Remote Assistant agents and Dell RAID agents.
The events consist of SNMP traps as well as indications.
Battery Events


August 2001
DMTF:Portable Battery Critical Combined Batteries
Charge, DMTF:Portable Battery Maintenance Required -- These
events are generated when the combined charge of all portable batteries in a
system is running critically low, or if the battery is defective and needs
maintenance.
DMTF:UPS Battery Utility Power Lost System On Battery
and DMTF:UPS Battery Utility Power Up System Off Battery
-- These two events are generated when the primary power used by the
Page 15
Dell Enterprise Systems Group


system is lost and system starts using UPS Battery, and when the power is
back and system stops using UPS Battery.
Drac2 Battery Good – associated with Dell Remote Assistant Card
battery condition and occurs when battery with low charge is re-charged
above the specific threshold.
RAID: Battery Events – if you have Dell Power Edge RAID Controller
to monitor your storage devices you can monitor these events.
The events described here are generated by DMTF components: DRAC as well as
Dell Power Edge RAID Controller (PERC) agents. These events could be
monitored if using DRAC or PERC are in use on the Dell systems.
All the above events are SNMP traps.
Electric Current Events


Current Warning/Failure and Returned to Normal – Current
sensor on the power supply exceeded its warning or failure threshold. A
normal event is logged when the current sensor reading is back to normal
after crossing such threshold.
Current Probe Non Recoverable – Current sensor detected a value
from which it cannot recover.
The above events are SNMP traps and additional details, such as location and readings,
are provided with these events.
Voltage Events




Voltage Warning/Failure and Returned to Normal – When the
voltage sensor exceeds the warning or failure range threshold. A normal
event is logged when it’s returned to normal after crossing the threshold.
Voltage Probe Non Recoverable – when the voltage sensor in the
specified system detects a value from which it cannot recover
Above events are SNMP traps and additional details, such as location and readings
are provided with these events.
Voltage Too High – (for DRAC2 agent) This trap is sent each time a
voltage channel reading for the Dell Remote Assistant Card goes out of
critical range.
Power Supply Events



August 2001
Power Supply Failure, Power Supply Failure returned to
Normal – occurs when power supply is disconnected or is failed. A normal
event is logged when it comes back to normal from such state.
Power Supply Lost Redundancy, Power Supply Redundancy
Normal -- When one of the power supply components in the redundancy
unit is disconnected or failed or is not present. A normal event is logged
when it is back from such state.
Power Supply Degraded Redundancy, Power Supply
Redundancy Normal -- The redundancy unit sensor in the main chassis
Page 16
Dell Enterprise Systems Group
detected that one of the power supply units has failed, but the overall power
supply is still redundant. IMPORTANT NOTE: These redundancy events are
generated only by the Dell HIP instrumentation agent. To monitor the redundancy
events for any redundant unit (including power supply) generated by Dell Server
Agent instrumentation, please look at the redundancy event types in ‘Other’
category.
All the above events are SNMP traps and additional details, such as location and
readings, are provided with these events.


Power Supply Status Change events are all DMI indications; the status change
events for power supply e.g. – ‘Power Supply Status Change Critical (Power Supply)’ could be monitored to find out any status
changes in power supplies while using DMI.
The AC Power Cord events are generated by the Dell Server Agent
Instrumentation for redundant AC power cords associated with AC fail over
switch. This feature is available only on certain Dell Servers like PE2500.
Processor Events
The events in this category are generated by the agents listed in Table 10, along
with the type of event.
Table 10: Processor Events Agents
Event Sources
DMTF (through mapper)
Dell OpenManage Client Instrumentation
Event Types
SNMP traps
DMI Indications, SNMP traps
There are no processor related events generated by Dell Server Instrumentation.
The events generated in this category are from DMTF and Dell OpenManage
Client Instrumentation agent. Here is some explanation on the critical events in
this category.




DMTF: Motherboard processor failure -- Associated with the
processor on the system motherboard
Processor Failure – Associated with any type of processor in the system
DMTF: Processor Configuration Error, Processor
Initialization Failure are evident and the Processor System Up
will be sent when the processor is initialized properly.
Number of Processors Increased/Decreased, Processor Type
Changed -- these are SNMP as well as DMI events while the Processor
Type Changed is DMI only.
Only Client Instrumentation generates these events.
These events could be monitored to find out processor configuration/initialization related
changes on nodes, though these should not be frequently occurring.
August 2001
Page 17
Dell Enterprise Systems Group
August 2001
Page 18
Dell Enterprise Systems Group
Security Events
The events in this category are generated by the agents listed in Table 11, along
with the type of event.
Table 11: Security Events Agents
Event Sources:
DMTF (through mapper)
Dell OpenManage HIP/Server Agent
SNMP Agent
Event Types:
SNMP traps
DMI Indications
SNMP traps
The event types described under this category are extremely important events to
monitor on remotely managed nodes.

DMTF: Physical container configuration error – When chassis
or other physical container is not properly configured
 DMTF: System Hardware Security Container Security
Breach, Security Settings Change – Critical, Security
Settings Change – OK – occur when the chassis intrusion sensor detects
that chassis is intruded when the system is in operation. A normal event is
generated when the intrusion returns to normal.
 Security Settings Change – Non-Critical, Security
Settings Change – Non-Recoverable - These two events are same and
ITA EMS would be updated for it.
These events are generated by Dell OpenManage HIP/Server Agent through SNMP and
DMI.
 SNMP Community Name incorrect – In the SNMP service properties
page under ‘Security’ tab, if you check ‘Send Authentication Trap’ check box
, and if SNMP agent receives a request with incorrect community name or
the request is not sent from an acceptable host, this event is generated.
Software Events
The events in this category are generated by the agents listed in Table 12, along
with the type of event.
Table 12: Software Events Agents
Event Sources
DMTF (through mapper)
Dell OpenManage HIP/Server Agent
Dell OpenManage IT Assistant
Event Types
SNMP traps
DMI Indications
SNMP traps
Dell OpenManage HIP/Server Agent/Client Instrumentation applications and the
Dell OpenManage IT Assistant itself generate these events. All the events in this
category are either related to system up status or system down status.
August 2001
Page 19
Dell Enterprise Systems Group
Events from Instrumentation Agents

System Up – Critical, Non-Critical and Non-recoverable -all these are DMI Indications generated by Dell instrumentation agents.
Typically these are logged when the system is started after a reboot, reset or
crash. Depending on system health the severity and hence the message text
will be changed.
Note that System Up – Critical and Non-recoverable are the same
events having critical severity and you can monitor either of these two
depending on what message you want to see.

Events from IT Assistant
The System Up and The System Down mssg (SNMP traps) from IT
Assistant are the only two events generated by IT Assistant application itself.
These events especially the System Down message could be very important to
monitor to execute an action when the monitored system goes down. Note that
it is IT Assistant, and not the SNMP Agent, that generates these traps. Even if
you don’t have the SNMP agent configured correctly in terms of community
name or specifying the trap destination; you can configure IT Assistant to send
out system up and down traps.
The status of the system is detected by IT Assistant during discovery and if IT
Assistant discovers the system as powered off, the IT Assistant management station
(not the managed node system which is powered off) will send out the system
down trap impersonating the managed node system as the sender. This will
help in setting up the filter for the system up or down trap, for that managed
node system and associate an action to execute.
The trapconfig.cfg file installed with IT Assistant must be configured to receive
these events.
Storage Events
The events in this category are generated by the agents listed in Table 13, along
with the type of event.
Table 13: Storage Events Agents
Event Sources
DMTF (through mapper)
Dell Array Manager Agent
RAID Agent (PERC, PERC2)
Dell Remote Assistant Server
Dell OpenManage Client Instrumentation
SCSI Agent (CIO)
August 2001
Page 20
Event Types
SNMP traps
SNMP traps
SNMP traps
SNMP traps
SNMP traps
SNMP traps
Dell Enterprise Systems Group
Symbios Agent
DMI indications
The events in this category are logged mainly by the Dell Array Manager Agent
and Dell PERC agents. Depending on the storage agent application you are
using, you should monitor the events from these two agents respectively. This
category contains events on array disks, battery backup units, events on
consistency checks and their progress, controllers, disks, enclosures, drives,
mirrors, SMART events, RAID Drives, UPS, Virtual Disks etc. Because of the
very high number of events in this category, it isn’t possible to discuss all the
important events here. Most of the events are obvious, and depending on event
subtype, you can be selective in deciding which events to monitor. Table 14
shows the important events for each subtype.
Table 14: Critical Events in the Storage Categories
Storage Category
Array Disks
Consistency Check
Events
Controllers
Disks
Enclosures
Drives
SMART Events
RAID Drives
UPS
Virtual Disks
Other
August 2001
Critical Event Types to Monitor
Array Disk Failed, Array Disk diagnostics failed, Array Disk Format
Failed, Array Disk Initialize Failed, Array Disk Rebuild Failed,
Check Consistency Failed3, Consistency Check Error On Logical
Drive, Consistency Check Failed On Logical Drive, Consistency Check
Failed On Physical Device Failure, Container Failure,
Container Failure2, Controller Dead, Controller Firmware Mismatch,
Internal Controller Hung, Internal Controller I960 Processor Specific
Error, Internal Controller Strong-ARM Processor Specific Error,
Storage Controller ErrorSYMBIOS, Storage Device ErrorSYMBIOS, System
Disconnecting From Absent Controller
Device Failure2, Enclosure Fan Error2, Enclosure General Error2,
Enclosure Power Supply Error2, Enclosure Temperature Abnormal2,
Enclosure Temperature Over User Threshold2, Hard Disk Failure
events, Hard Disk SCSI Bus Reset Failed, Hard Disk Write Recovery
Failed,
Storage Works Enclosure Failed,
Error- Rebuild Of Logical Drive Failed, Logical Drive Critical, Logical
Drive Initialization Failed, Mirror Drive Failure2, Physical Drive Missing
On Startup, Rebuild Of Logical Drive Failed
IDE SMART Pre-FailureOMCI, SCSI SMART Pre-FailureOMCI
Raid Drive Failed2, Raid Failed On Lack Of Resource2, RAID: Check
Consistency Aborted1,2, RAID: Initialize Aborted1,2, RAID: Initialize
Failed1,2, RAID: Physical Drive State Failed1,2, RAID: Logical Drive
State Offline1,2, RAID: Logical Drive State Degraded1,2, RAID:
Reconstruction Failed1,2,
Uninterruptible Power Supply Failed,
Virtual Disk Failed, Virtual Disk Format Failed, Virtual Disk Initialize
Failed, Virtual Disk Rebuild Failed, Virtual Disk Reconfig Failed,
WARM BOOT Failed, Write Back Error
Expand Capacity Stopped with error, Error- Rebuild Stopped, Fan
Failure, Initialization Canceled, Initialization Failed, Installation
Aborted, Over Temperature, Possible Data Loss, Power Supply
Failure, SCSI Command Abort, Server Lost Connection Or Down,
Temperature Over Safe Limit
Page 21
Dell Enterprise Systems Group
PERC Agent
Agent
3 OpenManage Array Manager, PERC and PERC2 agents
OMCI OpenManage Client Instrumentation
SYMBIOS Symbios Agent
1
2 PERC2
August 2001
Page 22
Dell Enterprise Systems Group
Section
5
Best Practices While Selecting Events to
be Monitored
If the administrator wants to monitor only important events for the selected Dell
Agents and carry out a particular action, it is very important to understand the
critical events from all the pre-populated events in IT Assistant database. What
follows are some of the important points to remember:

Monitor only SNMP traps or DMI indications whenever possible to avoid
duplication of events and the confusion that can arise because of different
wording of the same event, or from the common subset.

While creating custom events for monitoring, make sure the SNMP OIDs or
DMI Source definitions (like Associated Group, Event Type etc.) are not the
same as any existing event. This is important even if you select only the
custom-created event for monitoring, and not select the duplicate one which
is pre-populated in IT Assistant; the custom event created and selected can
be ignored because of the way IT Assistant filtering criteria works.
Remember, you can rename the pre-populated events and change the
message text to customize for your environment.

The event types starting with ‘DMTF’ are DMTF indications converted into
SNMP traps. Note that these events are not supported by Dell OpenManage
Server Agent instrumentation. Dell Hardware Instrumentation Package agent
and Dell OpenManage Client Instrumentation support subset of these
events.

While monitoring the events, you don’t need to know the SNMP OIDs or
DMI Associated Groups. You need these only while creating custom events.
You can view the OIDs or Group names by back referencing the event type.
Table 15 shows the important events associated with some critical failures in the
system. Note that unless stated, these events are SNMP traps.
Table 15: Baseline Event Types Associated with Critical Failures
Critical Failure Type
System Fan Failure
System Fan in Critical
Status
August 2001
Event
Category
Environmental
Environmental
Page 23
Baseline Event Type to monitor
Cooling Device Failure
Cooling Device Malfunction (DMI only), Cooling
Device Status Change - Critical (Fan)(DMI
Dell Enterprise Systems Group
Critical Failure Type
August 2001
Event
Category
Power Supply Failure
Power Supply (or
Fan) Redundancy
Lost
Power
Other
Memory Failure
Memory
Memory Pre-Failure
Warning
Processor Failure
Memory
Temperature Failure
(in a system
backplane or board
etc)
Chassis security
breach
Environmental
System down
Software
NIC Failure (for Intel
NIC)
NIC Failure (for
Broadcom NIC)
Drive Failure
Network
Disk Failure
Storage
Failure of a cluster or
node in a cluster
Cluster
Processor
Security
Network
Storage
Page 24
Baseline Event Type to monitor
only)
Power Supply Failure
Redundancy Lost (For Server Agent
Instrumentation. This is for any redundant unit
like Fan, AC Cords for the power switch etc.),
Power Supply Lost Redundancy (only by HIP
Instrumentation)
Memory Device Failure (Note that this really
means that memory device pre-failure sensor
detected a critical value), Memory Errors (DMI
only),
Memory Device Warning
DMTF: Motherboard processor failure,
Processor Failure (Please see the ‘Processor
Events’ category for details)
Temperature Failure, Temperature Fault –
Critical (DMI only)
Security Settings Change – Critical, Container
Security Breach – Critical (DMI only, from
‘Other’ category)
The System Down mssg from IT Assistant
(Only when using IT Assistant to discover the
node)
Intel NIC Link Down, LAN Controller hardware
Failure
NIC Failover Event
Logical Drive Critical, Physical Drive Missing
On Startup, RAID:Physical Drive State Failed,
Mirror Drive Failure
Array Disk Failed, Device Failure, Hard Disk
Failed, Virtual Disk Failed, Hard Disk May Fail
Soon
Failure of a node in an MSCS cluster, Failure
of an MSCS cluster
Dell Enterprise Systems Group
Section
6
Conclusion
The Event Management System feature of IT Assistant can be used in various
ways to monitor different Dell servers and client systems, and can be customized
to suit the environment. Even though you can select all these events for all the
systems and not pay attention to individual events, if you want to be selective, it
is imperative to know the important events to be monitored. Knowing these
events can also help you execute email or paging actions for critical events. This
paper provided the details of important events generated by all supported Dell
Agents, and explained the granularity and low-level details of each of these
events. In planned future releases of IT Assistant all these pre-populated events
will be cleaned and streamlined.
Please refer to DMTF documents (available at http://www.dmtf.org) for more
details on DMTF events and the following Dell documents for details about the
events generated by specific Dell Agent Applications.
There are User’s Guides available for:
 Dell OpenManage Server Agent 4.x
 Dell OpenManage Hardware Instrumentation Package 3.x
 Dell OpenManage Array Manager 2.x, 3.x
 Dell OpenManage IT Assistant 6.x
 Dell OpenManage IT Assistant Database Management Utility
Other useful references include the following:
 Dell OpenManage Server Agent Message Reference Guide

Configuring and Using the Dell OpenManage IT Assistant Event
Management System Article
Manoj Gujarathi ([email protected]) is a systems engineer at Dell Computer
Corporation (http://www. dell.com). He has over four years of experience in system
management applications and he currently works as a lead engineer for Dell
OpenManage IT Assistant and Dell OpenManage Connections applications. Manoj has a
Master’s in Engineering from Washington State University and Master’s in Computer
Science from Texas Tech University. He is a Microsoft Certified Systems Engineer.
August 2001
Page 25
Dell Enterprise Systems Group
Dell, OpenManage, PowerEdge, PowerVault, and PowerApp are trademarks of Dell Computer Corporation.
Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and
names or their products. Dell disclaims proprietary interest in the marks and names of others.
©Copyright 2001 Dell Computer Corporation. All rights reserved. Reproduction in any manner whatsoever without the
express written permission of Dell Computer Corporation is strictly forbidden. For more information, contact Dell. Dell
cannot be responsible for errors in typography or photography.
Information in this document is subject to change without notice.
August 2001
Page 26
Dell Enterprise Systems Group