Download Data Integrity for Compaq NonStop® Himalaya Servers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multidimensional empirical mode decomposition wikipedia , lookup

Fault tolerance wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Transcript
Data Integrity for Compaq
NonStop® Himalaya Servers
Compaq NonStop® Himalaya Servers White Paper
Data integrity concepts, features,
and technology
Commercial computer systems are used for applications that are critical to our health, safety, and
financial security, such as emergency 911 services,
healthcare, vehicle routing, and stock market
transactions. In such applications, it is a fundamental expectation that computer systems will
always provide the correct data. Incorrect data
could result in a train wreck or could cause an
emergency call to be routed to the wrong location. A change of even a single bit of data in a
financial transaction can alter its value by millions of dollars or cause the transaction to be
recorded in the wrong account.
Contents
13
Compaq NonStop® Himalaya
system design philosophy:
Fail-fast
14
Compaq NonStop® Himalaya
system processors: Lockstepped
microprocessors
15
Compaq NonStop® Himalaya
system storage: End-to-end
checksums
15
Compaq NonStop® Himalaya
system communications:
ServerNet technology
Future trends
16
Compaq NonStop® Himalaya
system software
Data corruption
consequences
17
Data integrity features of
Compaq NonStop® Himalaya
systems
Conclusion: The Compaq
NonStop® Himalaya system
advantage
18
References
1
Introduction
2
Data integrity concepts
4
Data corruption causes
4
Electronic noise
6
Physical hardware defects
6
Hardware design errors
7
Software design errors
8
Data corruption frequency
10
11
13
Most computer system manufacturers rely on memory error-correcting
code (ECC) and disk vendor storage protection mechanisms to safeguard their data. However, there
are many other computer system components, particularly microprocessors, controllers, and buses,
that have the potential to corrupt data. This problem is being exacerbated by continually shrinking
electronic circuit geometries as more functionality is absorbed into a single chip. Because of the low
cost and excellent performance of today’s commercial microprocessors, business-critical commercial computer systems are using the same components contained in desktop PCs. But because of the
competitive nature of the PC industry, these components do not have the built-in features, such as
self-checking, that are necessary to guarantee data integrity. System vendors need to use the PCbased components to remain price competitive but at the same time must find ways to safeguard
their customers’ data.
Compaq NonStop® Himalaya servers incorporate the world’s best data integrity features. These features include lockstepped microprocessors (two microprocessors that execute the same instruction
stream and crosscheck each other), end-to-end checksums on data sent to storage devices, and complete protection of all internal buses and drivers. Stock markets and financial institutions depend on
Compaq NonStop® Himalaya servers to protect their data from hardware and software faults, power
glitches, and other failure mechanisms that could alter their transactions and cause potentially disastrous results. Telecommunications companies depend on Compaq NonStop® Himalaya servers to
ensure that their calls are routed to the right customers. Retailers depend on Compaq NonStop®
Himalaya servers to accurately process credit card transactions.
The first section of this white paper provides a brief introduction to data integrity concepts. The
next section describes the underlying causes of data corruption. The third and fourth sections
describe the frequency and effects of data corruption. The fifth section explains the Compaq
NonStop® Himalaya systems technology that ensures data integrity. The final section summarizes
the unique data integrity features and advantages of Compaq NonStop® Himalaya servers.
Data Integrity for Compaq NonStop® Himalaya Servers
1
Data integrity concepts
A change in a data value is physically recorded by a change in voltage. If the
voltage is not recorded correctly or is inadvertently changed after being
recorded, data corruption has occurred.
All computer systems use numerical values called data to represent information. This
represented information can be almost anything, including a letter on this page, a bank
account value, a credit card number, a temporary calculation, or a software instruction. A
computer system uses data for calculations and
stores data, both permanently on a disk or tape
drive and temporarily in the computer system
memory. Data is transmitted among computers
using networks such as the Internet. Data often
changes—for example, when an account balance is updated—but computer system users
always expect that the computer system will
maintain the integrity of the data, meaning
that the computer system will never inadvertently or incorrectly alter a numerical value. An
inadvertent or incorrect change that compromises data integrity is called data corruption.
A digital computer system represents data in
terms of ones and zeros. Each single 1 or 0 is
called a bit, and 8 bits is typically referred to as
a byte. Bits are physically encoded as a small
voltage in an electronic circuit. There are many
different techniques and circuits for performing this data encoding, but the basic idea is
that a relatively high voltage represents a 1 and
a relatively low (or nonexistent) value represents a 0. A change in a data value is physically
recorded by a change in voltage. If the voltage is
not recorded correctly or is inadvertently
changed after being recorded, data corruption
has occurred.
The previous paragraph describes how data is
stored in computer memory, but there are
many places outside the computer memory
that data corruption can occur. Figure 1 shows a
simplified data path between memory and permanent storage on a disk drive. A typical transaction consists of a processor retrieving data
from a disk drive, performing some calculations, and writing the modified data back to the
disk. If any component in the data path inadvertently changes a bit, data corruption has
occurred. Such components include the processor, memory, disk, and all the controllers and
buses that transport data from one location to
another. Each of the blocks in Figure 1 typically
consists of multiple components and contains
a small amount of memory for buffers, queues,
and/or code space. All of these components and
memory have the potential to corrupt data.
Data Integrity for Compaq NonStop® Himalaya Servers
2
Data corruption does not automatically imply
an incorrect calculation or an incorrect value in
a database, or even an operational error. The
corruption may change a bit in a memory location that is not used or that is overwritten
before it is used; it may cause an exception or
retry or processor halt by branching to an illegal
instruction; it may change a data value that
does not affect the results of the computation;
or it may cause an incorrect entry in a log. It is
even possible that the system detects the data
corruption and corrects the incorrect value.
There are two approaches to improving data
integrity. The first is to prevent data corruption
from occurring. This is the province of integrated circuit manufacturers. They use special
types of materials to make devices, especially
dynamic random access memories (DRAMs)
and static random access memories (SRAMs),1
that are less sensitive to the types of electronic
noise that can cause a bit to change. This
approach is described in the following section.
The second approach is to detect and possibly
correct the data corruption. This is the province
of system vendors. Almost all computer vendors
provide some form of error detection and correction on main memory, and most provide
some form of parity protection on secondary
cache. However, as described later in this paper,
only Compaq NonStop® Himalaya servers provide error detection and correction throughout
the entire system rather than just on the memory devices.
1 SRAMs and DRAMs are memory devices used to temporarily store the data needed for microprocessor calculations
and the results of the calculations. These devices improve
computer performance by providing faster access to data
than disk drives. SRAMs are faster but more expensive
than DRAMs, so SRAMs are usually used for a small area of
memory called secondary cache, whereas DRAMs are used
for the main memory. In the past, SRAMs were used for a
small area of memory called primary cache, but current
microprocessors have absorbed the primary cache onto
the microprocessor chip to improve performance.
Microprocessor
Memory
Memory controller
Disk drive
media
SCSI bus
Output
Bus
formatter controller
Bus
converter
Disk drive
controller
Bus
SCSI
controller controller
Figure 1. Data path from memory to disk.
Data Integrity for Compaq NonStop® Himalaya Servers
3
Data corruption causes
In the future, devices will become even more susceptible to data corruption
caused by high-energy particles, and the protection that Compaq NonStop®
Himalaya servers provide against data corruption will become even more critical.
There are many possible causes of data
corruption in a computer system. These
causes can be grouped into the following four
categories:
➔ Electronic noise: An externally caused current flow that disrupts a stored voltage,
usually without causing permanent
change to a device. Electronic noise is primarily caused by high-energy particles and
power disturbances.
➔ Physical hardware defects: Microscopic
holes or cracks, contamination, and packaging problems that alter current flow and
voltage values. Age-related phenomena
such as electron migration are included in
this category.
➔ Hardware design errors: Logic design errors,
circuit design errors, inadequate thermal
design (heat sinks), incorrect device specification or utilization, and timing errors that
cause bits to be misread or miscoded.
➔ Software design errors: Incorrect algorithms,
software errors that overwrite good data,
and error recovery software that does not
correctly restore data following a failure.
High-energy particles and other electronic
noise sources typically do not permanently
damage an electronic device but usually cause
a one-time change to a bit or multiple bits.
These “single upset” events are usually called
soft errors and vanish when a new data value
(voltage) is written to their location. Physical
hardware defects and design errors are more
likely to cause permanent changes and are
called hard errors. The most insidious failure
modes are those permanent changes that
cause intermittent errors. This can happen with
a component that has marginal voltage or timing, causing a bit to be read as a 1 some of the
time and as a 0 other times. These types of failure modes cause field failures that usually cannot be diagnosed when the part is returned to
the vendor (resulting in “no defect/trouble
found”).
Electronic noise
Electronic noise consists of current fluctuation
due to internal or external sources. Because
electronic circuits use current flow to record
the voltage values that represent data, electronic noise has the potential to alter voltage,
thus changing a data value. The most common
cause of electronic noise in memory devices is
high-energy particles, either alpha particles or
extraterrestrial nuclear particles.
Alpha particles are usually generated by
radioactive decay of trace radioactivity in semiconductor packaging materials. This trace
radioactivity comes from impurities in the
packaging material or impurities added in the
manufacturing process—for example, material
deposited during a cleaning process using
slightly radioactive water or acid. Alpha particles have been a known problem for many
years, and memory device vendors invest significant resources to protect against alpha particles. Besides carefully evaluating and testing
Data Integrity for Compaq NonStop® Himalaya Servers
4
packaging materials, memory vendors often
add layers on top of the silicon (epitaxial layers)
to absorb alpha particles and protect the memory die inside the package with a “glob” of nonradioactive, highly absorptive material. Because
alpha particles have a relatively large crosssection, this “glob” absorbs most of them before
they can get to the electronic circuits that store
the data.
Elementary particles such as neutrons or protons originating in the sun or other stars are
usually called cosmic rays. For memory devices,
IBM experiments show that “under normal
operations, cosmic rays are by far the predominant cause of soft errors.”2 At sea level, 95 percent of the particles are neutrons. Memory
vendors have only recently become concerned
with high-energy neutrons as the transistor
size and amount of voltage used to encode a bit
has shrunk to the point where they can alter a
significant number of bits. It is much more difficult to shield against high-energy neutrons
than against alpha particles because their
cross-section is much smaller. Six feet of concrete is needed to significantly reduce the flux
of cosmic rays, which is not usually a reasonable
requirement for the roof of a computer room.
The mechanism through which high-energy
alpha particles and nuclear particles alter a
stored charge is very similar. Upon entering the
device, a particle creates a trail of negative
charge (electrons) and a corresponding positive
charge (electron “holes”). This trail causes the
current to flow into an electronic circuit, which
can upset and change the circuit’s voltage.
Memory manufacturers have incorporated
resistors and other devices into electronic
circuits to prevent some of these voltage
changes, but memory cells are still susceptible
to particle impact.
The current flow created by high-energy particles is unlikely to cause problems with other
types of electronic circuits that perform calculations or transfer data. For example, the arithmetic logic unit (ALU) within a microprocessor
has an active circuit that continuously supplies
current, and the current flow caused by a highenergy particle is not usually sufficient to upset
the voltage. Data on internal buses is also
driven with sufficient voltage to avoid being
upset by current flow caused by high-energy
particles. The most vulnerable areas on these
chips are the primary cache on microprocessors,
the unprotected temporary storage registers on
the microprocessor, and the small data storage
areas on other integrated circuits (ICs).
The other common source of electronic noise is
power transients. A typical computer site sees
443 power disturbances such as sags and
surges per year, according to a National Power
Laboratory study.3 Most of these events have a
very short duration, which can cause a “glitch”
or momentary change of current supplied to a
circuit. The current flow caused by a glitch can
disrupt stored data. These kinds of problems are
well known, and hardware designers work diligently to protect electronic circuits against
such external disturbances.
3 From D. S. Dorr,“National Power Laboratory Power Quality
Study Initial Results,” Proc Applied Power Electronics
Conference, February 1992 (a 1990–1995 National Power
Laboratory study of 235 computer sites in the United
States).
2 From T. J. O’Gorman, et al.,“Field Testing for Cosmic Ray
Soft Errors in Semiconductor Memories,” IBM Journal of
Research and Development, January 1996, pp. 41–49.
Data Integrity for Compaq NonStop® Himalaya Servers
5
Physical hardware defects
Hardware design errors
An integrated circuit is a complex combination
of many materials, including metals, alloys,
ceramics, and polymers. The thermal, chemical,
mechanical, structural, and electrical characteristics of the various materials must be carefully
balanced. It is easy for imperfections in the
interactions or interfaces of these materials to
occur, which may cause an integrated circuit to
fail. There are thousands of places to introduce
imperfections in integrated circuit design,
wafer fabrication, assembly, handling, and testing. Contamination has long been recognized as
a potential problem leading to heavy investment in “clean rooms.” Some examples of manufacturing-induced defects include
➔ Improper film thickness
➔ Trapped moisture (corrosion)
➔ Small cracks/voids
➔ Residual cleaning chemicals
➔ Dust
➔ Open wire bonds
➔ Thermal stress
➔ Metal bridges
Design errors in a manufacturing process can
cause the manufacturing imperfections just
described. However, there can also be design
errors in specifying the function, timing, and
interface characteristics of a device or in the
logic and circuit design.
Physical hardware defects may reveal themselves during manufacturing testing or may not
become evident until some time after being
shipped in a product. For example, a small crack
may grow over a period of operation and/or
temperature variation. At some point, the crack
may prevent sufficient current flow or may
cause other operational errors. This may lead to
an obvious failure of the device or a less obvious
incorrect operation, which could cause data corruption. The size of the crack may shrink and
grow, depending on stress and temperature,
leading to intermittently correct and incorrect
results. These types of defects are very hard
to diagnose.
In addition to being stored as a voltage, data
and control signals are read as a voltage. If a signal voltage is above a certain threshold, then
the data or control bit is read as a 1, whereas if it
is below the threshold it is read as a 0. When
changing a bit from a 1 to a 0 or vice versa, there
is a transition period to allow the voltage to
change. Because each individual device will
have slightly different signal delay (impedance)
and timing characteristics, the length of that
transition period will vary. The final voltage
value attained will also vary slightly as a function of the device characteristics and the operating environment (temperature, humidity).
Computer hardware engineers allow a certain
period of time (called design margin) for the
transition period to complete and the voltage
value to settle. If there are timing errors or
insufficient design margins that cause the voltage to be read at the wrong time, the voltage
value may be read incorrectly, and a bit may be
misinterpreted, causing data corruption. Note
that this corruption can occur anywhere in the
system and can cause incorrect data to be written to disk even when there are no errors in
computer memory or in the calculations.
Hardware design errors can also cause electronic noise to be generated. For example, if two
signal lines are too close together, they may
interfere with each other under certain conditions (a phenomenon known as cross-talk) and
change a data value on one of the lines.
Data Integrity for Compaq NonStop® Himalaya Servers
6
Software design errors
Software errors that affect data integrity
include errors in calculations, errors that alter
correctly stored data, and incorrect restoration
of data corrupted by a failure.
If the algorithm used to compute a value is
incorrect, not much can be done outside of
good software engineering practices to avoid
such mistakes. However, many applications
have checks to protect against incorrect calculations. System vendors must ensure that the
software beneath the application layer does not
corrupt data.
A processor may attempt to write to the wrong
location in memory, which may then overwrite
and corrupt a value. In this case, it is possible to
avoid data corruption by not allowing the processor to write to a location that has not been
specifically allocated for the value it is attempting to write.
Following a processor halt or disk crash or other
failure, a computer system almost certainly has
corrupt data residing in memory and disk storage. Transactions will have been partially completed, leaving the database in an incorrect
state. For example, in a transfer of $1,000 from
a savings account to a checking account, the
savings account may have been decremented
by $1,000, but the checking account may not
have been incremented by $1,000. The computer memory and open files may have partial
changes or other corrupt data. Error recovery
software must correctly restore the database,
files, and memory to avoid data corruption
caused by partial calculations or incomplete
transactions.
Data Integrity for Compaq NonStop® Himalaya Servers
7
Data corruption frequency
For business-critical computing, the only acceptable number of data
corruptions is zero.
Microprocessor failure rates are based
on a number of factors: complexity, technology,
packaging, manufacturing process, and operating environment. There are handbooks that are
widely used for estimating hard failure rates.
The most applicable handbook for a commercial environment is the Bellcore handbook.4
This handbook measures failure rates in terms
of failures per billion hours, called FITs.
Microprocessors differ, but a reasonable estimate for the failure rate of current complex
microprocessors is 1,000 FITs. Some hard failures, such as an output pin short, can be easily
detected and immediately cause the processor
to stop functioning. Other hard failures, such as
a gate that is stuck at 0 or 1, can be more subtle
and require detection by hardware or software
mechanisms. From vendor and other data, it is
estimated that half of the failures are obvious
and half are subtle, so that 500 FITs is the
appropriate failure rate to use for evaluating
potential data corruption.
Unfortunately, there are no handbooks for evaluating transient and soft error rates. Using vendor data, it is estimated that the combined
transient/soft error rate is 4,000 FITs, with
2,000 FITs being for the logic and 2,000 FITs for
the primary cache on the chip. For the primary
cache, it is estimated that 90 percent of the
errors are caught by on-chip ECC (which is usually just simple parity), so 200 FITs is the appropriate primary cache failure rate to use for
evaluating potential data corruption.
The total failure rate used for evaluating
potential data corruption is 2,700 FITs: 500 of
which are due to hard failures, 2,000 due to
transient/soft logic errors, and 200 due to primary cache errors. A failure rate of 2,700 FITs
corresponds to a failure rate of 24 microprocessors per 1,000 per year. Note that this is an
estimated average for current microprocessors
and failure rates could vary significantly. A reasonable range is probably 1,000 to 10,000 FITs.
Because data corruption locations are random,
it is not possible to predict data corruption
effects. The failure rate just calculated describes
the potential for data corruption, but it is difficult to determine whether a transient error in
the system will become corrupted data. It
depends on the system and application, the
operating environment, and on random events.
4 “Reliability Prediction Procedure for Electronic
Equipment,” Bellcore Technical Reference TR-332, Issue 6,
December 1997.
Data Integrity for Compaq NonStop® Himalaya Servers
8
Many data corruptions in a microprocessor are
benign. For example, an error in a location in primary or secondary cache could occur within
instructions or data that are never used (such as
an error in a conditional branch instruction that
is not exercised) or be overwritten before it can
be used. Others could cause errors that are
detected by the operating system. These
include mechanisms to detect illegal instructions, branches to nonexistent locations, bus
errors, bad system calls, arithmetic exceptions
(overflow), improper values, and illegal characters. These types of errors may cause a retry
mechanism to be invoked and allow the system
to recover or may cause a processor to halt or a
system process to abend. An error could also
cause an endless loop or a branch to a location
that does not return, both of which lead to a
processor hang or a time-out.
If the hardware or operating system does not
detect the error, the application may have builtin checks that allow it to detect data corruption.
For example, it may check to ensure that current results are compatible with previously
stored results or data. Some variables may have
range checks. The application may also have
additional checks similar to the end-to-end
checksums that Compaq NonStop® Himalaya
systems provide (described in the section “Data
integrity features of Compaq NonStop®
Himalaya systems”).
There have been various studies to try to estimate the probability that an error will become
corrupted data. The following table shows the
results from one such study.5 In this study, various errors were injected into a workstation
while running a matrix multiplication application. More than 600,000 cases were run and
compared to the known correct results. Because
the purpose of the study was to evaluate error
detection methods, features were added to the
operating system to improve its robustness, and
a checksum method was evaluated for error
detection.
Error detection percentages
Result of error
Percent of cases
Notes
Detected
by system or
application
46.4%
43.4% detected
system, 3% by
application
Undetected,
benign error
41.2%
Data corruption
12.4%
5.5% detected by
checksum
Nearly half of the errors were detected by the
system or application by built-in error detection
mechanisms, such as traps for illegal instructions, arithmetic exceptions, and incorrect system calls. These detected errors might cause a
software retry, a log entry, an error returned to
the user, or a processor halt, depending on how
they manifest themselves. A little more than 40
percent of the errors were undetected but had
no effect, for example, values that were overwritten before they could be used or errors in
the addresses of branches that were not taken.
This still left 12.4 percent of errors that were
undetected by the system or application but
caused data corruption. The checksum technique being evaluated in the study found 5.5
percent of the data corruptions, but even with
this special application error detection mechanism, 6.9 percent of the errors were still completely undetected.
To estimate the number of undetected data corruptions, the potential data corruption rate of
24 per 1,000 microprocessors per year is combined with the estimate (see “Error detection
percentages” table) that 12.4 percent of these
failures will cause undetected data corruption.
This yields a rate of about 3 undetected data
corruptions per 1,000 microprocessors per year.
5 From G. Kanawati, N. Kanawati, and J. Abraham,
“FERRARI: A Tool for the Evaluation of System
Dependability Properties,” Proc 22nd International
Symposium on Fault-Tolerant Computing, June 1992.
Data Integrity for Compaq NonStop® Himalaya Servers
9
A similar model is described in a paper about
data corruption.6 The paper concludes that an
undetected data corruption can occur about
once per year per 1,000 installed microprocessors. The paper also concludes that there are
wide error bounds on such an estimate because
of the lack of data on transient errors and error
propagation. This white paper and the paper on
data corruption both provide similar estimates
of data corruption frequency and both note
that there are wide error bounds on this
estimate.
In some sense, the exact data corruption frequency is unimportant. For business-critical
computing, the only acceptable number of data
corruptions is zero.
6 From R. Horst, D. Jewett, and D. Lenoski,“The Risk of Data
Corruption in Microprocessor-based Systems,” Proc 23rd
International Symposium on Fault-Tolerant Computing,
June 1993.
Future trends
The semiconductor industry attempts to maintain Moore’s law, meaning that microprocessor
performance doubles every 18 months. Thus,
the trend is toward increased functionality for
each square centimeter of silicon, which is
achieved by having smaller devices (gates,
transistors, memory cells) embedded in the
silicon and packing them tighter together.
This trend implies
➔ Smaller component feature sizes that
require less energy to be disturbed
➔ Increased component density in a chip,
meaning that high-energy particles are
more likely to collide with a component in
the chip and cause a disruption
➔ Reduced amounts of voltage used to perform calculations and store data, meaning
that electronic noise immunity and transistor breakdown voltage are reduced
These changes mean that, in the future, devices
will become even more susceptible to data corruption caused by high-energy particles, and
the protection that Compaq NonStop® Himalaya
servers provide against data corruption will
become even more critical.
Data Integrity for Compaq NonStop® Himalaya Servers
10
Data corruption consequences
When data is corrupted in a computer system, almost anything can happen.
Sometimes you get lucky, and the corruption goes unnoticed. Other times,
you don’t get so lucky, and $1 billion ends up in the wrong place.
There are more than 100 million PCs and
other computers in the United States alone,
meaning that the preceding models predict
hundreds of data corruptions every day. So, why
don’t we hear more about data corruption in
the news media? First of all, companies are not
going to be very forthcoming if they determine
that there was a data corruption in their systems that caused an error, but, even more likely,
companies are simply not aware that there was
a data corruption. Most people assume that
there was an operator error or software defect
that caused the incorrect data, which is often
true. Because we know how fallible humans are
and do not normally think about the potential
effects of high-energy nuclear particles, human
error is assumed to be the cause of the problem. Even if there is suspicion about the computer itself, most commercial computers have
no error indications or logs that could be used
to track the source of the data corruption. There
is probably no way to ever determine what
really caused a bit to flip. Have you ever had to
reboot your PC? Maybe the real problem was
data corruption rather than the application
software you probably grumbled about.
If the stored data is in an airline reservation system, does it just garble the name of a city, or
does it change a flight number or date, causing
a passenger’s reservation to be wrong? If the
stored data is in medical records, does it simply
garble the name of a patient, or does it cause
the wrong drug or dosage to be given to a
patient? And if the stored data is in motor vehicle records, will the wrong person receive a
traffic ticket?
Assume that someone is calling 911 when a
transient error occurs. Using data from the
“Error detection percentages” table, the results
of the emergency call are shown in figure 2.
When you call 911 and a transient error occurs, you get …
50%
40%
30%
20%
10%
0%
For a business-critical system, the challenge is
trying to assess the impact of the corrupted
data value. If it is a monetary value, is the
changed value off by $1, $1,000, $1 million, or
more? If the stored data is in e-mail, does it simply change a character in a message (a great
new excuse for typos), or does it send a confidential e-mail to the wrong location (your
Internet service provider’s worst nightmare)?
911
Delay or
no answer
Ernie’s
Pizza
911 = Undetected, benign error
Delay or no answer = Detected error
Ernie’s Pizza = Data corruption
Figure 2. Data corruption effects on a 911 emergency call.
Data Integrity for Compaq NonStop® Himalaya Servers
11
The effect of a data corruption doesn’t necessarily stop with a single wrong number.
Corrupted data is usually stored in the database
and has the potential to do more damage. If the
data is used in more calculations, those future
calculations could then be incorrect. Data corruption can also detract from system availability. If the data in a database is corrupted, the
operators may have to shut down the system
and run extensive checks and repairs, perhaps
restoring the database from tape before a business can afford to continue. A second kind of
database corruption affects the linkages among
the various tables. This causes queries to fail
because tables are incorrectly linked and pointers reference the wrong locations in physical
memory. Operators have to manually restore
these linkages—a process that can take days.
When data is corrupted in a computer system,
almost anything can happen. The results
depend on the corruption location, the timing
of the corruption, the contents of memory, the
application and execution environment, and
the protection mechanisms built into the system and application. Sometimes you get lucky,
and the corruption goes unnoticed or is
detected and corrected. Other times, you
don’t get so lucky, and $1 billion ends up in
the wrong place.
Data corruption can also detract from system
availability. If the data in a database is corrupted,
the operators may have to shut down the system
and run extensive checks and repairs, perhaps
restoring the database from tape before a
business can afford to continue.
Data Integrity for Compaq NonStop® Himalaya Servers
12
Data integrity features of Compaq
NonStop® Himalaya systems
There is no equivalent to Compaq NonStop® Himalaya system data integrity
in other computer systems. Compaq is the only vendor that builds processors
with lockstepped microprocessors, extensive parity checking on the buses, and
end-to-end checksums.
As described in an earlier section, there are
many potential causes of data corruption.
Although the entire computer is vulnerable to
data corruption, most vendors provide only a
limited form of data integrity checking—for
example, error-correcting code (ECC) on memory and disks. Because data integrity is a
required component for business-critical
computing, Compaq NonStop® Himalaya systems provide industry-leading data integrity.
Compaq NonStop® Himalaya
system design philosophy: Fail-fast
Because Compaq NonStop® Himalaya servers
are fault tolerant, it might be expected that all
their hardware modules are fault tolerant. On
the contrary, Compaq deliberately wants hardware modules to be fault intolerant. That is,
Compaq wants any incorrectly functioning
hardware module to detect the problem and
shut itself down as quickly as possible. This concept is called fail-fast and prevents hardware
errors from leading to data corruption.
The fail-fast design philosophy means that
each hardware and software module is selfchecking and immediately ceases operation
rather than permits errors to propagate.
Although a module may attempt to recover
from a fault—for example, ECC on memory—it
will immediately halt if there is any possibility
that data corruption will result from continued
operation. This helps avoid data corruption and
system outages caused by a single propagated
hardware or software fault. The lockstepped
microprocessors described in the next section
are an excellent example of the fail-fast
philosophy.
This fail-fast design philosophy is important
both for data integrity and for serviceability. If
either a hardware or software error is allowed
to propagate, it is often difficult to determine
where the failure occurred because the evidence is either gone or obscured. It is then very
difficult to find and fix the root cause of the
problem. This leads to repeated failures
because service personnel might replace the
wrong hardware unit or be unable to find and
repair a software defect. It also leads to the
inefficient service technique of “shotgunning,”
or replacing one module after another in an
attempt to fix the problem.
When a server running on the Compaq
NonStop® Kernel operating system halts, it
saves its state. This allows customers or service
personnel to dump the processor before reloading it and send the dump to Compaq’s failure
analysis group. This information and the failfast philosophy allow hardware and software
developers to do a better job of analyzing problems. It may be the only way to determine the
root cause of a transient error such as a timing
problem. While Compaq is able to analyze problems and help prevent their reoccurrence, most
vendors can only reboot and hope a problem
does not occur again.
Data Integrity for Compaq NonStop® Himalaya Servers
13
Compaq NonStop® Himalaya
system processors: Lockstepped
microprocessors
Modern microprocessors, with their internal
state machines, registers, data paths, and
onboard cache, have the potential to flip bits or
otherwise corrupt data. Without a mechanism
to check the integrity of the data, these errors
can propagate and corrupt a database.
Commercial microprocessors generally do not
incorporate internal parity on registers and data
paths, and they do not include logic to check
state machines. The reason stems from a belief
that internal self-checking hurts costs, performance, or time to market.
Compaq NonStop® Himalaya system processors
contain two microprocessor chips. These microprocessors are lockstepped; that is, they run
exactly the same instruction stream. The output
from the two microprocessors is compared and
if it should ever differ, the processor output is
frozen within a few nanoseconds so that the
corrupted data cannot propagate.
The Compaq NonStop® Himalaya server’s lockstepped microprocessor architecture is shown
in Figure 3.
An incoming request (Compaq ServerNet
technology packet) is sent to both interface
application-specific integrated circuits (ASICs)7,
which translate it and forward it to the microprocessors. Each microprocessor simultaneously
services the request with simultaneous access
to secondary cache and memory (using controllers and buses that are not shown). The output response from the microprocessors is
compared by the interface ASICs. If there is even
a single bit difference, the microprocessor outputs are immediately frozen to prevent a corrupt ServerNet packet from being transmitted. It
is important that data integrity is protected by
a hardware error freeze rather than a software
halt, as occurs in other computer designs.
Compaq NonStop® Himalaya systems deliberately contain fault-intolerant hardware
modules. This concept is called fail-fast and
prevents hardware errors from leading to data
corruption.
A software halt requires milliseconds of latency,
which, in some cases, is enough time to allow
the corrupt data to be output from the microprocessor and propagate throughout the system. A hardware error freeze is guaranteed to
avoid the latency problem.
Compaq NonStop® Himalaya systems and other
computers use state-of-the-art memory detection and correction to correct single-bit errors,
detect double-bit errors, and detect “nibble”
errors (three or four bits in a row, which could be
caused by complete failure of a single DRAM).
However, Compaq NonStop® Himalaya systems
go beyond this protection by modifying the vendor’s ECC to include both address and data bits
in the ECC code. This helps avoid reading from or
writing to the wrong memory location. In addition, Compaq NonStop® Himalaya systems provide a memory “sniffer” that runs in the
background and tests the entire memory every
few hours. This sniffer prevents latent faults in
seldom-used areas of memory from accumulating and causing an undetectable error.
Secondary
cache
Memory
Microprocessor
Microprocessor
Check
Interface
ASIC
Interface
ASIC
ServerNet
technology
7 An application-specific integrated circuit (ASIC) is a device
that usually contains several hundred thousand gates,
performs a wide variety of functions specific to the design,
and significantly reduces the number of discrete components on a circuit board.
Secondary
cache
Figure 3. Compaq NonStop® Himalaya server’s lockstepped microprocessor
architecture
Data Integrity for Compaq NonStop® Himalaya Servers
14
Compaq NonStop® Himalaya
system storage: End-to-end
checksums
Compaq NonStop® Himalaya
system communications: ServerNet
technology
After the data leaves the microprocessors, it is
subject to glitches on the buses or defects in
other components. In Compaq NonStop®
Himalaya systems, data on all the buses is parity protected, and parity errors cause immediate
interrupts to trigger error recovery or, if necessary, a processor halt. The microprocessors in
I/O controllers are protected by parity checks,
packet sequence numbers, and checksums to
ensure that the data is not corrupted by bus or
component errors. All messages are protected
by checksums using a combination of hardware
and software.
ServerNet technology provides the communication network among processors and peripherals
in Compaq NonStop® Himalaya servers. The
ServerNet protocol provides the best data
integrity, error detection, and fault isolation
capabilities in the industry. Specifically,
ServerNet technology includes the following
capabilities:
➔ All command symbols are coded so that
single-bit errors create an invalid symbol
rather than a different command or data
symbol. Receipt of an invalid symbol triggers a check of the physical connection.
➔ Each ServerNet packet has a 32-bit cyclic
redundancy check (CRC) checksum to detect
data or control errors. This is much more
robust than simple parity or checksums
used in other protocols.
➔ Routing information and CRC are checked in
every ServerNet link. The first ServerNet link
that detects a bad packet will mark the
packet as incorrect and generate an interrupt with the location at which the bad
packet was detected. This pinpoints the
location of the error.
➔ The ServerNet routing tables are protected
by parity checks. In addition, the ASICs that
implement ServerNet technology are selfchecking components.
➔ Link-level flow control is included in the protocol to help alleviate network congestion
and route around a device that is “babbling”
on the network.
➔ End-to-end flow control included in the protocol helps prevent devices from injecting
more packets into the network than can be
handled efficiently, thus avoiding network
saturation. This is accomplished by requiring a positive acknowledgment for every
packet sent.
➔ If an acknowledgment is not received for a
packet, the protocol automatically checks
the end-to-end link and resends the packet
if a transient error occurred. This check
includes flushing any “stale” packets that
could cause spurious errors once the link is
again operating normally.
One of the most important Compaq NonStop®
Himalaya system data integrity features is endto-end checksums. The Compaq NonStop®
Himalaya system disk driver software creates
an end-to-end checksum, consisting of a 2-byte
checksum appended to a standard 512-byte disk
sector before data is written to a disk. For structured data such as SQL files, an additional endto-end checksum (called a block checksum)
encodes data values, physical location of the
data, and transaction information. The block
checksum is included in the header of a block
of data that ranges in size from 512 bytes to
4,096 bytes.
Because these checksums are added in the
microprocessor, they protect against errors in
all the buses and components that manage the
reading and writing of the data. They protect
against corrupted data values, partial writes,
and misplaced or misaligned data. When the
data is read from disk, the checksum is checked
to ensure that the data is correct and that the
location from which the data is read is correct. If
either the data or location is incorrect, appropriate corrective action, such as reading from the
mirror disk, is taken.
Data stored on tape, usually for backup purposes, automatically retains the disk checksums. An additional 2-byte checksum is added
to the tape record header to protect against
errors such as misaligned data that could occur
when transferring the data from disk to tape.
Data Integrity for Compaq NonStop® Himalaya Servers
15
These capabilities of the ServerNet technology
prevent errors from propagating and immediately locate the source of the error. Other interconnect technologies do not provide these
features and may require extensive troubleshooting to find the cause of the problem,
which increases the repair time. ServerNet
technology is an example of the Compaq
NonStop® Himalaya system’s fail-fast design
philosophy.
Compaq NonStop® Himalaya
system software
Although it is not possible to prevent application errors, system vendors must ensure that
the system software does not inadvertently
change or overwrite correct data. There are
many data integrity checks built into the
Compaq NonStop® Kernel operating system. For
example, Compaq NonStop® Himalaya systems
require explicit access validation (source,
address, and permissions must all be correct)
for all reads and writes to memory. This prevents a processor from incorrectly overwriting a
memory location of another processor.
The Compaq NonStop® Kernel operating system
also verifies that data structures and pointers
are correct when they are used. For example, it
checks pointers on both ends of a link in a doubly linked list to ensure they reference each
other. Although other operating systems may
perform similar verification, it is not as pervasive as in the Compaq NonStop® Kernel operating system. Another protection mechanism is
that interrupts are queued rather than automatically given the highest priority. This avoids
interrupt flooding caused by a stuck interrupt
line on an I/O controller or other hardware that
could spoof the protection mechanisms.
The most important function that system software plays in protecting data is proper cleanup
following errors or failures. When a computer
fails for any reason, in-flight transactions may
be partially completed, open files may be corrupted, memory values may be inaccurate, and
the database may be left in an incorrect state.
In the event of a failure, the Compaq NonStop®
SQL/MP database, Compaq NonStop®
Transaction Manager (NonStop® TM/MP) software, and Compaq NonStop®TUXEDO® transaction monitor ensure that in-flight transactions
are aborted and that the database is returned
to its last known good state, from which point
transactions can be reapplied.
Although other vendors provide similar transaction monitoring and database recovery facilities, Compaq NonStop® Himalaya systems have
some natural advantages. The first is that the
process-pair Compaq NonStop® Himalaya system architecture prevents single processor failures from causing a system failure. If a
processor fails, backup processes in other processors take over within a few seconds. Open
files remain open, and data is not corrupted.
Therefore, Compaq NonStop® Himalaya systems have less opportunity to make an error following a crash because of their inherent robust
availability features.
Another natural advantage for Compaq
NonStop® Himalaya systems is that all the software and hardware is built and tested together
by the Tandem Division of Compaq. Most other
vendors have to combine software built by several different vendors. In Compaq NonStop®
Himalaya systems, the hardware, operating system, file system, storage system, database, and
transaction monitor are all built, integrated,
and tested as a single entity by a single vendor.
There is no equivalent to Compaq NonStop®
Himalaya system data integrity in other commercial computer systems. No vendor other
than Compaq builds processors with lockstepped microprocessors, extensive parity
checking on the buses, and end-to-end checksums. It is theoretically possible for a thirdparty vendor to develop end-to-end checksums,
but it requires special hardware formatting of
the disks, special controller firmware, and special driver software to create and write data
checksums, data address and volume sequence
numbers, and checksums of checksums.
Compaq is the only vendor with the dedication
to data integrity necessary to provide such
features.
Data Integrity for Compaq NonStop® Himalaya Servers
16
Conclusion:The Compaq NonStop® Himalaya
system advantage
Only Compaq NonStop® Himalaya systems demonstrate a true commitment
to data integrity. As a result, Compaq NonStop® Himalaya systems are the clear
leader, now and for the foreseeable future.
Data integrity is a key requirement for business-critical computing. Stock exchanges,
banks, telecommunications companies, and the
transaction processing applications of most
businesses cannot afford to risk the integrity of
their data. An error in a single bit can result
in a $1 million mistake or put lives at risk.
Rapidly advancing computer systems are so
complex that no one has figured out how to
design for or test for all the potential timing
problems, unexpected interactions, and nonrepeatable transient states that occur in the real
world. A truly safe computing environment can
only be achieved if data integrity is a primary
design objective and after many years of
maturing in a field environment.
Industry-standard servers are unable to focus
on data integrity to the same extent as Compaq
NonStop® Himalaya systems. Intense competitive pressures prevent high-volume servers
from taking the additional time to design and
test advanced data integrity features. The additional costs associated with such special features create additional pressure to trade off
data integrity for lower costs.
Compaq NonStop® Himalaya systems are based
on 25 years of experience with business-critical
applications and enable us to avoid unexpected
problems, either by initial design or in response
to customer problems. Compaq NonStop®
Himalaya systems are built and tested entirely
by the Tandem Division of Compaq, which
enables us to assure that all hardware and software is designed and built with the same
strong dedication to data integrity. And, when
an error does occur, our fail-fast architecture
protects your data from contamination.
The following features of Compaq NonStop®
Himalaya systems are unmatched in the computer industry:
➔ Lockstepped microprocessors continually
crosscheck each other’s output and immediately freeze if any difference is detected.
This prevents bit flips caused by such
things as high-energy particles, power fluctuations, manufacturing imperfections,
and timing errors from being used in calculations and corrupting a database.
➔ Each hardware and software module is
self-checking and immediately halts rather
than permit an error to propagate—a
concept known as the fail-fast design philosophy. This philosophy makes it possible
to determine the source of errors and
correct them.
➔ Compaq NonStop® Himalaya systems incorporate state-of-the-art memory detection
and correction to correct single-bit errors,
detect double-bit errors, and detect “nibble” errors (three or four bits in a row).
Compaq has modified the vendor’s ECC to
include address bits, which helps avoid
reading from or writing to the wrong memory location.
➔ Compaq NonStop® Himalaya system hardware is continually checked for latent
faults. A background memory “sniffer”
checks the entire memory every few hours.
The multiple data paths provided for fault
tolerance are alternately used to ensure
correct operation.
Data Integrity for Compaq NonStop® Himalaya Servers
17
➔ ServerNet technology provides command
symbol encoding, a 32-bit CRC checksum to
detect data or control errors, packets guaranteed to arrive in the correct order, flow
control, and flushing of “stale” data. If an
error is detected, the ServerNet protocol
pinpoints the location of that error for
corrective action.
➔ Data on all the buses is parity protected,
and parity errors cause immediate interrupts to trigger error recovery.
➔ Microprocessors in I/O controllers are protected by parity checks, packet sequence
numbers, and checksums to ensure that
the data is not corrupted by bus or component errors.
➔ All messages are protected by checksums
using a combination of hardware and
software.
➔ Disk driver software provides an end-to-end
checksum appended to a standard 512-byte
disk sector. For structured data such as SQL
files, an additional end-to-end checksum
(called a block checksum) encodes data values, physical location of the data, and transaction information. These checksums
protect against corrupted data values, partial writes, and misplaced or misaligned
data. Tape backup software retains the disk
checksums and adds an additional layer of
protection.
➔ The Compaq NonStop® Kernel operating system verifies that data structures and pointers are correct when they are used.
➔ The Compaq NonStop® Kernel operating system requires explicit access validation
(source, address, and permissions must all
be correct) for all reads and writes to memory. This prevents a processor from incorrectly overwriting a memory location of
another processor.
➔ Compaq NonStop® Himalaya systems are
built and tested entirely by the Tandem
Division of Compaq, making it much easier
to ensure that all the hardware and software is designed and built with a strong
dedication to data integrity.
There is no equivalent to the Compaq NonStop®
Himalaya server’s level of data integrity, today
or in the foreseeable future. And there is no
comparable commitment to ensuring the
integrity of your data.
References
W. Baker, R. Horst, D. Sonnier, and W. Watson,
“A Flexible ServerNet Based Fault-Tolerant
Architecture,” Proc 25th International
Symposium on Fault-Tolerant Computing,
June 1995.
J. Bartlett, et al.,“Fault Tolerance in Tandem
Computer Systems,” in Reliable Computer
Systems, D. Siewiorek and R. Swarz, eds. Digital
Press, 1992 (also available as Tandem Technical
Report 90.5).
E. Hnatek, Integrated Circuit Quality and
Reliability, Marcel Dekker, Inc., 1995.
For More Information
WEBSITE: www.compaq.com
©1999 Compaq Computer Corporation. All rights reserved. April 1999. Compaq, Himalaya, NonStop, ServerNet, and Tandem registered U.S. Patent and Trademark Office. TUXEDO is a registered trademark of Novell, Inc., exclusively licensed to BEA Systems, Inc. Other product names mentioned herein may be trademarks and/or registered trademarks of their respective companies. Technical specifications and availability are subject to change without notice.
98-0893