Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Integrity for Compaq NonStop® Himalaya Servers Compaq NonStop® Himalaya Servers White Paper Data integrity concepts, features, and technology Commercial computer systems are used for applications that are critical to our health, safety, and financial security, such as emergency 911 services, healthcare, vehicle routing, and stock market transactions. In such applications, it is a fundamental expectation that computer systems will always provide the correct data. Incorrect data could result in a train wreck or could cause an emergency call to be routed to the wrong location. A change of even a single bit of data in a financial transaction can alter its value by millions of dollars or cause the transaction to be recorded in the wrong account. Contents 13 Compaq NonStop® Himalaya system design philosophy: Fail-fast 14 Compaq NonStop® Himalaya system processors: Lockstepped microprocessors 15 Compaq NonStop® Himalaya system storage: End-to-end checksums 15 Compaq NonStop® Himalaya system communications: ServerNet technology Future trends 16 Compaq NonStop® Himalaya system software Data corruption consequences 17 Data integrity features of Compaq NonStop® Himalaya systems Conclusion: The Compaq NonStop® Himalaya system advantage 18 References 1 Introduction 2 Data integrity concepts 4 Data corruption causes 4 Electronic noise 6 Physical hardware defects 6 Hardware design errors 7 Software design errors 8 Data corruption frequency 10 11 13 Most computer system manufacturers rely on memory error-correcting code (ECC) and disk vendor storage protection mechanisms to safeguard their data. However, there are many other computer system components, particularly microprocessors, controllers, and buses, that have the potential to corrupt data. This problem is being exacerbated by continually shrinking electronic circuit geometries as more functionality is absorbed into a single chip. Because of the low cost and excellent performance of today’s commercial microprocessors, business-critical commercial computer systems are using the same components contained in desktop PCs. But because of the competitive nature of the PC industry, these components do not have the built-in features, such as self-checking, that are necessary to guarantee data integrity. System vendors need to use the PCbased components to remain price competitive but at the same time must find ways to safeguard their customers’ data. Compaq NonStop® Himalaya servers incorporate the world’s best data integrity features. These features include lockstepped microprocessors (two microprocessors that execute the same instruction stream and crosscheck each other), end-to-end checksums on data sent to storage devices, and complete protection of all internal buses and drivers. Stock markets and financial institutions depend on Compaq NonStop® Himalaya servers to protect their data from hardware and software faults, power glitches, and other failure mechanisms that could alter their transactions and cause potentially disastrous results. Telecommunications companies depend on Compaq NonStop® Himalaya servers to ensure that their calls are routed to the right customers. Retailers depend on Compaq NonStop® Himalaya servers to accurately process credit card transactions. The first section of this white paper provides a brief introduction to data integrity concepts. The next section describes the underlying causes of data corruption. The third and fourth sections describe the frequency and effects of data corruption. The fifth section explains the Compaq NonStop® Himalaya systems technology that ensures data integrity. The final section summarizes the unique data integrity features and advantages of Compaq NonStop® Himalaya servers. Data Integrity for Compaq NonStop® Himalaya Servers 1 Data integrity concepts A change in a data value is physically recorded by a change in voltage. If the voltage is not recorded correctly or is inadvertently changed after being recorded, data corruption has occurred. All computer systems use numerical values called data to represent information. This represented information can be almost anything, including a letter on this page, a bank account value, a credit card number, a temporary calculation, or a software instruction. A computer system uses data for calculations and stores data, both permanently on a disk or tape drive and temporarily in the computer system memory. Data is transmitted among computers using networks such as the Internet. Data often changes—for example, when an account balance is updated—but computer system users always expect that the computer system will maintain the integrity of the data, meaning that the computer system will never inadvertently or incorrectly alter a numerical value. An inadvertent or incorrect change that compromises data integrity is called data corruption. A digital computer system represents data in terms of ones and zeros. Each single 1 or 0 is called a bit, and 8 bits is typically referred to as a byte. Bits are physically encoded as a small voltage in an electronic circuit. There are many different techniques and circuits for performing this data encoding, but the basic idea is that a relatively high voltage represents a 1 and a relatively low (or nonexistent) value represents a 0. A change in a data value is physically recorded by a change in voltage. If the voltage is not recorded correctly or is inadvertently changed after being recorded, data corruption has occurred. The previous paragraph describes how data is stored in computer memory, but there are many places outside the computer memory that data corruption can occur. Figure 1 shows a simplified data path between memory and permanent storage on a disk drive. A typical transaction consists of a processor retrieving data from a disk drive, performing some calculations, and writing the modified data back to the disk. If any component in the data path inadvertently changes a bit, data corruption has occurred. Such components include the processor, memory, disk, and all the controllers and buses that transport data from one location to another. Each of the blocks in Figure 1 typically consists of multiple components and contains a small amount of memory for buffers, queues, and/or code space. All of these components and memory have the potential to corrupt data. Data Integrity for Compaq NonStop® Himalaya Servers 2 Data corruption does not automatically imply an incorrect calculation or an incorrect value in a database, or even an operational error. The corruption may change a bit in a memory location that is not used or that is overwritten before it is used; it may cause an exception or retry or processor halt by branching to an illegal instruction; it may change a data value that does not affect the results of the computation; or it may cause an incorrect entry in a log. It is even possible that the system detects the data corruption and corrects the incorrect value. There are two approaches to improving data integrity. The first is to prevent data corruption from occurring. This is the province of integrated circuit manufacturers. They use special types of materials to make devices, especially dynamic random access memories (DRAMs) and static random access memories (SRAMs),1 that are less sensitive to the types of electronic noise that can cause a bit to change. This approach is described in the following section. The second approach is to detect and possibly correct the data corruption. This is the province of system vendors. Almost all computer vendors provide some form of error detection and correction on main memory, and most provide some form of parity protection on secondary cache. However, as described later in this paper, only Compaq NonStop® Himalaya servers provide error detection and correction throughout the entire system rather than just on the memory devices. 1 SRAMs and DRAMs are memory devices used to temporarily store the data needed for microprocessor calculations and the results of the calculations. These devices improve computer performance by providing faster access to data than disk drives. SRAMs are faster but more expensive than DRAMs, so SRAMs are usually used for a small area of memory called secondary cache, whereas DRAMs are used for the main memory. In the past, SRAMs were used for a small area of memory called primary cache, but current microprocessors have absorbed the primary cache onto the microprocessor chip to improve performance. Microprocessor Memory Memory controller Disk drive media SCSI bus Output Bus formatter controller Bus converter Disk drive controller Bus SCSI controller controller Figure 1. Data path from memory to disk. Data Integrity for Compaq NonStop® Himalaya Servers 3 Data corruption causes In the future, devices will become even more susceptible to data corruption caused by high-energy particles, and the protection that Compaq NonStop® Himalaya servers provide against data corruption will become even more critical. There are many possible causes of data corruption in a computer system. These causes can be grouped into the following four categories: ➔ Electronic noise: An externally caused current flow that disrupts a stored voltage, usually without causing permanent change to a device. Electronic noise is primarily caused by high-energy particles and power disturbances. ➔ Physical hardware defects: Microscopic holes or cracks, contamination, and packaging problems that alter current flow and voltage values. Age-related phenomena such as electron migration are included in this category. ➔ Hardware design errors: Logic design errors, circuit design errors, inadequate thermal design (heat sinks), incorrect device specification or utilization, and timing errors that cause bits to be misread or miscoded. ➔ Software design errors: Incorrect algorithms, software errors that overwrite good data, and error recovery software that does not correctly restore data following a failure. High-energy particles and other electronic noise sources typically do not permanently damage an electronic device but usually cause a one-time change to a bit or multiple bits. These “single upset” events are usually called soft errors and vanish when a new data value (voltage) is written to their location. Physical hardware defects and design errors are more likely to cause permanent changes and are called hard errors. The most insidious failure modes are those permanent changes that cause intermittent errors. This can happen with a component that has marginal voltage or timing, causing a bit to be read as a 1 some of the time and as a 0 other times. These types of failure modes cause field failures that usually cannot be diagnosed when the part is returned to the vendor (resulting in “no defect/trouble found”). Electronic noise Electronic noise consists of current fluctuation due to internal or external sources. Because electronic circuits use current flow to record the voltage values that represent data, electronic noise has the potential to alter voltage, thus changing a data value. The most common cause of electronic noise in memory devices is high-energy particles, either alpha particles or extraterrestrial nuclear particles. Alpha particles are usually generated by radioactive decay of trace radioactivity in semiconductor packaging materials. This trace radioactivity comes from impurities in the packaging material or impurities added in the manufacturing process—for example, material deposited during a cleaning process using slightly radioactive water or acid. Alpha particles have been a known problem for many years, and memory device vendors invest significant resources to protect against alpha particles. Besides carefully evaluating and testing Data Integrity for Compaq NonStop® Himalaya Servers 4 packaging materials, memory vendors often add layers on top of the silicon (epitaxial layers) to absorb alpha particles and protect the memory die inside the package with a “glob” of nonradioactive, highly absorptive material. Because alpha particles have a relatively large crosssection, this “glob” absorbs most of them before they can get to the electronic circuits that store the data. Elementary particles such as neutrons or protons originating in the sun or other stars are usually called cosmic rays. For memory devices, IBM experiments show that “under normal operations, cosmic rays are by far the predominant cause of soft errors.”2 At sea level, 95 percent of the particles are neutrons. Memory vendors have only recently become concerned with high-energy neutrons as the transistor size and amount of voltage used to encode a bit has shrunk to the point where they can alter a significant number of bits. It is much more difficult to shield against high-energy neutrons than against alpha particles because their cross-section is much smaller. Six feet of concrete is needed to significantly reduce the flux of cosmic rays, which is not usually a reasonable requirement for the roof of a computer room. The mechanism through which high-energy alpha particles and nuclear particles alter a stored charge is very similar. Upon entering the device, a particle creates a trail of negative charge (electrons) and a corresponding positive charge (electron “holes”). This trail causes the current to flow into an electronic circuit, which can upset and change the circuit’s voltage. Memory manufacturers have incorporated resistors and other devices into electronic circuits to prevent some of these voltage changes, but memory cells are still susceptible to particle impact. The current flow created by high-energy particles is unlikely to cause problems with other types of electronic circuits that perform calculations or transfer data. For example, the arithmetic logic unit (ALU) within a microprocessor has an active circuit that continuously supplies current, and the current flow caused by a highenergy particle is not usually sufficient to upset the voltage. Data on internal buses is also driven with sufficient voltage to avoid being upset by current flow caused by high-energy particles. The most vulnerable areas on these chips are the primary cache on microprocessors, the unprotected temporary storage registers on the microprocessor, and the small data storage areas on other integrated circuits (ICs). The other common source of electronic noise is power transients. A typical computer site sees 443 power disturbances such as sags and surges per year, according to a National Power Laboratory study.3 Most of these events have a very short duration, which can cause a “glitch” or momentary change of current supplied to a circuit. The current flow caused by a glitch can disrupt stored data. These kinds of problems are well known, and hardware designers work diligently to protect electronic circuits against such external disturbances. 3 From D. S. Dorr,“National Power Laboratory Power Quality Study Initial Results,” Proc Applied Power Electronics Conference, February 1992 (a 1990–1995 National Power Laboratory study of 235 computer sites in the United States). 2 From T. J. O’Gorman, et al.,“Field Testing for Cosmic Ray Soft Errors in Semiconductor Memories,” IBM Journal of Research and Development, January 1996, pp. 41–49. Data Integrity for Compaq NonStop® Himalaya Servers 5 Physical hardware defects Hardware design errors An integrated circuit is a complex combination of many materials, including metals, alloys, ceramics, and polymers. The thermal, chemical, mechanical, structural, and electrical characteristics of the various materials must be carefully balanced. It is easy for imperfections in the interactions or interfaces of these materials to occur, which may cause an integrated circuit to fail. There are thousands of places to introduce imperfections in integrated circuit design, wafer fabrication, assembly, handling, and testing. Contamination has long been recognized as a potential problem leading to heavy investment in “clean rooms.” Some examples of manufacturing-induced defects include ➔ Improper film thickness ➔ Trapped moisture (corrosion) ➔ Small cracks/voids ➔ Residual cleaning chemicals ➔ Dust ➔ Open wire bonds ➔ Thermal stress ➔ Metal bridges Design errors in a manufacturing process can cause the manufacturing imperfections just described. However, there can also be design errors in specifying the function, timing, and interface characteristics of a device or in the logic and circuit design. Physical hardware defects may reveal themselves during manufacturing testing or may not become evident until some time after being shipped in a product. For example, a small crack may grow over a period of operation and/or temperature variation. At some point, the crack may prevent sufficient current flow or may cause other operational errors. This may lead to an obvious failure of the device or a less obvious incorrect operation, which could cause data corruption. The size of the crack may shrink and grow, depending on stress and temperature, leading to intermittently correct and incorrect results. These types of defects are very hard to diagnose. In addition to being stored as a voltage, data and control signals are read as a voltage. If a signal voltage is above a certain threshold, then the data or control bit is read as a 1, whereas if it is below the threshold it is read as a 0. When changing a bit from a 1 to a 0 or vice versa, there is a transition period to allow the voltage to change. Because each individual device will have slightly different signal delay (impedance) and timing characteristics, the length of that transition period will vary. The final voltage value attained will also vary slightly as a function of the device characteristics and the operating environment (temperature, humidity). Computer hardware engineers allow a certain period of time (called design margin) for the transition period to complete and the voltage value to settle. If there are timing errors or insufficient design margins that cause the voltage to be read at the wrong time, the voltage value may be read incorrectly, and a bit may be misinterpreted, causing data corruption. Note that this corruption can occur anywhere in the system and can cause incorrect data to be written to disk even when there are no errors in computer memory or in the calculations. Hardware design errors can also cause electronic noise to be generated. For example, if two signal lines are too close together, they may interfere with each other under certain conditions (a phenomenon known as cross-talk) and change a data value on one of the lines. Data Integrity for Compaq NonStop® Himalaya Servers 6 Software design errors Software errors that affect data integrity include errors in calculations, errors that alter correctly stored data, and incorrect restoration of data corrupted by a failure. If the algorithm used to compute a value is incorrect, not much can be done outside of good software engineering practices to avoid such mistakes. However, many applications have checks to protect against incorrect calculations. System vendors must ensure that the software beneath the application layer does not corrupt data. A processor may attempt to write to the wrong location in memory, which may then overwrite and corrupt a value. In this case, it is possible to avoid data corruption by not allowing the processor to write to a location that has not been specifically allocated for the value it is attempting to write. Following a processor halt or disk crash or other failure, a computer system almost certainly has corrupt data residing in memory and disk storage. Transactions will have been partially completed, leaving the database in an incorrect state. For example, in a transfer of $1,000 from a savings account to a checking account, the savings account may have been decremented by $1,000, but the checking account may not have been incremented by $1,000. The computer memory and open files may have partial changes or other corrupt data. Error recovery software must correctly restore the database, files, and memory to avoid data corruption caused by partial calculations or incomplete transactions. Data Integrity for Compaq NonStop® Himalaya Servers 7 Data corruption frequency For business-critical computing, the only acceptable number of data corruptions is zero. Microprocessor failure rates are based on a number of factors: complexity, technology, packaging, manufacturing process, and operating environment. There are handbooks that are widely used for estimating hard failure rates. The most applicable handbook for a commercial environment is the Bellcore handbook.4 This handbook measures failure rates in terms of failures per billion hours, called FITs. Microprocessors differ, but a reasonable estimate for the failure rate of current complex microprocessors is 1,000 FITs. Some hard failures, such as an output pin short, can be easily detected and immediately cause the processor to stop functioning. Other hard failures, such as a gate that is stuck at 0 or 1, can be more subtle and require detection by hardware or software mechanisms. From vendor and other data, it is estimated that half of the failures are obvious and half are subtle, so that 500 FITs is the appropriate failure rate to use for evaluating potential data corruption. Unfortunately, there are no handbooks for evaluating transient and soft error rates. Using vendor data, it is estimated that the combined transient/soft error rate is 4,000 FITs, with 2,000 FITs being for the logic and 2,000 FITs for the primary cache on the chip. For the primary cache, it is estimated that 90 percent of the errors are caught by on-chip ECC (which is usually just simple parity), so 200 FITs is the appropriate primary cache failure rate to use for evaluating potential data corruption. The total failure rate used for evaluating potential data corruption is 2,700 FITs: 500 of which are due to hard failures, 2,000 due to transient/soft logic errors, and 200 due to primary cache errors. A failure rate of 2,700 FITs corresponds to a failure rate of 24 microprocessors per 1,000 per year. Note that this is an estimated average for current microprocessors and failure rates could vary significantly. A reasonable range is probably 1,000 to 10,000 FITs. Because data corruption locations are random, it is not possible to predict data corruption effects. The failure rate just calculated describes the potential for data corruption, but it is difficult to determine whether a transient error in the system will become corrupted data. It depends on the system and application, the operating environment, and on random events. 4 “Reliability Prediction Procedure for Electronic Equipment,” Bellcore Technical Reference TR-332, Issue 6, December 1997. Data Integrity for Compaq NonStop® Himalaya Servers 8 Many data corruptions in a microprocessor are benign. For example, an error in a location in primary or secondary cache could occur within instructions or data that are never used (such as an error in a conditional branch instruction that is not exercised) or be overwritten before it can be used. Others could cause errors that are detected by the operating system. These include mechanisms to detect illegal instructions, branches to nonexistent locations, bus errors, bad system calls, arithmetic exceptions (overflow), improper values, and illegal characters. These types of errors may cause a retry mechanism to be invoked and allow the system to recover or may cause a processor to halt or a system process to abend. An error could also cause an endless loop or a branch to a location that does not return, both of which lead to a processor hang or a time-out. If the hardware or operating system does not detect the error, the application may have builtin checks that allow it to detect data corruption. For example, it may check to ensure that current results are compatible with previously stored results or data. Some variables may have range checks. The application may also have additional checks similar to the end-to-end checksums that Compaq NonStop® Himalaya systems provide (described in the section “Data integrity features of Compaq NonStop® Himalaya systems”). There have been various studies to try to estimate the probability that an error will become corrupted data. The following table shows the results from one such study.5 In this study, various errors were injected into a workstation while running a matrix multiplication application. More than 600,000 cases were run and compared to the known correct results. Because the purpose of the study was to evaluate error detection methods, features were added to the operating system to improve its robustness, and a checksum method was evaluated for error detection. Error detection percentages Result of error Percent of cases Notes Detected by system or application 46.4% 43.4% detected system, 3% by application Undetected, benign error 41.2% Data corruption 12.4% 5.5% detected by checksum Nearly half of the errors were detected by the system or application by built-in error detection mechanisms, such as traps for illegal instructions, arithmetic exceptions, and incorrect system calls. These detected errors might cause a software retry, a log entry, an error returned to the user, or a processor halt, depending on how they manifest themselves. A little more than 40 percent of the errors were undetected but had no effect, for example, values that were overwritten before they could be used or errors in the addresses of branches that were not taken. This still left 12.4 percent of errors that were undetected by the system or application but caused data corruption. The checksum technique being evaluated in the study found 5.5 percent of the data corruptions, but even with this special application error detection mechanism, 6.9 percent of the errors were still completely undetected. To estimate the number of undetected data corruptions, the potential data corruption rate of 24 per 1,000 microprocessors per year is combined with the estimate (see “Error detection percentages” table) that 12.4 percent of these failures will cause undetected data corruption. This yields a rate of about 3 undetected data corruptions per 1,000 microprocessors per year. 5 From G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Evaluation of System Dependability Properties,” Proc 22nd International Symposium on Fault-Tolerant Computing, June 1992. Data Integrity for Compaq NonStop® Himalaya Servers 9 A similar model is described in a paper about data corruption.6 The paper concludes that an undetected data corruption can occur about once per year per 1,000 installed microprocessors. The paper also concludes that there are wide error bounds on such an estimate because of the lack of data on transient errors and error propagation. This white paper and the paper on data corruption both provide similar estimates of data corruption frequency and both note that there are wide error bounds on this estimate. In some sense, the exact data corruption frequency is unimportant. For business-critical computing, the only acceptable number of data corruptions is zero. 6 From R. Horst, D. Jewett, and D. Lenoski,“The Risk of Data Corruption in Microprocessor-based Systems,” Proc 23rd International Symposium on Fault-Tolerant Computing, June 1993. Future trends The semiconductor industry attempts to maintain Moore’s law, meaning that microprocessor performance doubles every 18 months. Thus, the trend is toward increased functionality for each square centimeter of silicon, which is achieved by having smaller devices (gates, transistors, memory cells) embedded in the silicon and packing them tighter together. This trend implies ➔ Smaller component feature sizes that require less energy to be disturbed ➔ Increased component density in a chip, meaning that high-energy particles are more likely to collide with a component in the chip and cause a disruption ➔ Reduced amounts of voltage used to perform calculations and store data, meaning that electronic noise immunity and transistor breakdown voltage are reduced These changes mean that, in the future, devices will become even more susceptible to data corruption caused by high-energy particles, and the protection that Compaq NonStop® Himalaya servers provide against data corruption will become even more critical. Data Integrity for Compaq NonStop® Himalaya Servers 10 Data corruption consequences When data is corrupted in a computer system, almost anything can happen. Sometimes you get lucky, and the corruption goes unnoticed. Other times, you don’t get so lucky, and $1 billion ends up in the wrong place. There are more than 100 million PCs and other computers in the United States alone, meaning that the preceding models predict hundreds of data corruptions every day. So, why don’t we hear more about data corruption in the news media? First of all, companies are not going to be very forthcoming if they determine that there was a data corruption in their systems that caused an error, but, even more likely, companies are simply not aware that there was a data corruption. Most people assume that there was an operator error or software defect that caused the incorrect data, which is often true. Because we know how fallible humans are and do not normally think about the potential effects of high-energy nuclear particles, human error is assumed to be the cause of the problem. Even if there is suspicion about the computer itself, most commercial computers have no error indications or logs that could be used to track the source of the data corruption. There is probably no way to ever determine what really caused a bit to flip. Have you ever had to reboot your PC? Maybe the real problem was data corruption rather than the application software you probably grumbled about. If the stored data is in an airline reservation system, does it just garble the name of a city, or does it change a flight number or date, causing a passenger’s reservation to be wrong? If the stored data is in medical records, does it simply garble the name of a patient, or does it cause the wrong drug or dosage to be given to a patient? And if the stored data is in motor vehicle records, will the wrong person receive a traffic ticket? Assume that someone is calling 911 when a transient error occurs. Using data from the “Error detection percentages” table, the results of the emergency call are shown in figure 2. When you call 911 and a transient error occurs, you get … 50% 40% 30% 20% 10% 0% For a business-critical system, the challenge is trying to assess the impact of the corrupted data value. If it is a monetary value, is the changed value off by $1, $1,000, $1 million, or more? If the stored data is in e-mail, does it simply change a character in a message (a great new excuse for typos), or does it send a confidential e-mail to the wrong location (your Internet service provider’s worst nightmare)? 911 Delay or no answer Ernie’s Pizza 911 = Undetected, benign error Delay or no answer = Detected error Ernie’s Pizza = Data corruption Figure 2. Data corruption effects on a 911 emergency call. Data Integrity for Compaq NonStop® Himalaya Servers 11 The effect of a data corruption doesn’t necessarily stop with a single wrong number. Corrupted data is usually stored in the database and has the potential to do more damage. If the data is used in more calculations, those future calculations could then be incorrect. Data corruption can also detract from system availability. If the data in a database is corrupted, the operators may have to shut down the system and run extensive checks and repairs, perhaps restoring the database from tape before a business can afford to continue. A second kind of database corruption affects the linkages among the various tables. This causes queries to fail because tables are incorrectly linked and pointers reference the wrong locations in physical memory. Operators have to manually restore these linkages—a process that can take days. When data is corrupted in a computer system, almost anything can happen. The results depend on the corruption location, the timing of the corruption, the contents of memory, the application and execution environment, and the protection mechanisms built into the system and application. Sometimes you get lucky, and the corruption goes unnoticed or is detected and corrected. Other times, you don’t get so lucky, and $1 billion ends up in the wrong place. Data corruption can also detract from system availability. If the data in a database is corrupted, the operators may have to shut down the system and run extensive checks and repairs, perhaps restoring the database from tape before a business can afford to continue. Data Integrity for Compaq NonStop® Himalaya Servers 12 Data integrity features of Compaq NonStop® Himalaya systems There is no equivalent to Compaq NonStop® Himalaya system data integrity in other computer systems. Compaq is the only vendor that builds processors with lockstepped microprocessors, extensive parity checking on the buses, and end-to-end checksums. As described in an earlier section, there are many potential causes of data corruption. Although the entire computer is vulnerable to data corruption, most vendors provide only a limited form of data integrity checking—for example, error-correcting code (ECC) on memory and disks. Because data integrity is a required component for business-critical computing, Compaq NonStop® Himalaya systems provide industry-leading data integrity. Compaq NonStop® Himalaya system design philosophy: Fail-fast Because Compaq NonStop® Himalaya servers are fault tolerant, it might be expected that all their hardware modules are fault tolerant. On the contrary, Compaq deliberately wants hardware modules to be fault intolerant. That is, Compaq wants any incorrectly functioning hardware module to detect the problem and shut itself down as quickly as possible. This concept is called fail-fast and prevents hardware errors from leading to data corruption. The fail-fast design philosophy means that each hardware and software module is selfchecking and immediately ceases operation rather than permits errors to propagate. Although a module may attempt to recover from a fault—for example, ECC on memory—it will immediately halt if there is any possibility that data corruption will result from continued operation. This helps avoid data corruption and system outages caused by a single propagated hardware or software fault. The lockstepped microprocessors described in the next section are an excellent example of the fail-fast philosophy. This fail-fast design philosophy is important both for data integrity and for serviceability. If either a hardware or software error is allowed to propagate, it is often difficult to determine where the failure occurred because the evidence is either gone or obscured. It is then very difficult to find and fix the root cause of the problem. This leads to repeated failures because service personnel might replace the wrong hardware unit or be unable to find and repair a software defect. It also leads to the inefficient service technique of “shotgunning,” or replacing one module after another in an attempt to fix the problem. When a server running on the Compaq NonStop® Kernel operating system halts, it saves its state. This allows customers or service personnel to dump the processor before reloading it and send the dump to Compaq’s failure analysis group. This information and the failfast philosophy allow hardware and software developers to do a better job of analyzing problems. It may be the only way to determine the root cause of a transient error such as a timing problem. While Compaq is able to analyze problems and help prevent their reoccurrence, most vendors can only reboot and hope a problem does not occur again. Data Integrity for Compaq NonStop® Himalaya Servers 13 Compaq NonStop® Himalaya system processors: Lockstepped microprocessors Modern microprocessors, with their internal state machines, registers, data paths, and onboard cache, have the potential to flip bits or otherwise corrupt data. Without a mechanism to check the integrity of the data, these errors can propagate and corrupt a database. Commercial microprocessors generally do not incorporate internal parity on registers and data paths, and they do not include logic to check state machines. The reason stems from a belief that internal self-checking hurts costs, performance, or time to market. Compaq NonStop® Himalaya system processors contain two microprocessor chips. These microprocessors are lockstepped; that is, they run exactly the same instruction stream. The output from the two microprocessors is compared and if it should ever differ, the processor output is frozen within a few nanoseconds so that the corrupted data cannot propagate. The Compaq NonStop® Himalaya server’s lockstepped microprocessor architecture is shown in Figure 3. An incoming request (Compaq ServerNet technology packet) is sent to both interface application-specific integrated circuits (ASICs)7, which translate it and forward it to the microprocessors. Each microprocessor simultaneously services the request with simultaneous access to secondary cache and memory (using controllers and buses that are not shown). The output response from the microprocessors is compared by the interface ASICs. If there is even a single bit difference, the microprocessor outputs are immediately frozen to prevent a corrupt ServerNet packet from being transmitted. It is important that data integrity is protected by a hardware error freeze rather than a software halt, as occurs in other computer designs. Compaq NonStop® Himalaya systems deliberately contain fault-intolerant hardware modules. This concept is called fail-fast and prevents hardware errors from leading to data corruption. A software halt requires milliseconds of latency, which, in some cases, is enough time to allow the corrupt data to be output from the microprocessor and propagate throughout the system. A hardware error freeze is guaranteed to avoid the latency problem. Compaq NonStop® Himalaya systems and other computers use state-of-the-art memory detection and correction to correct single-bit errors, detect double-bit errors, and detect “nibble” errors (three or four bits in a row, which could be caused by complete failure of a single DRAM). However, Compaq NonStop® Himalaya systems go beyond this protection by modifying the vendor’s ECC to include both address and data bits in the ECC code. This helps avoid reading from or writing to the wrong memory location. In addition, Compaq NonStop® Himalaya systems provide a memory “sniffer” that runs in the background and tests the entire memory every few hours. This sniffer prevents latent faults in seldom-used areas of memory from accumulating and causing an undetectable error. Secondary cache Memory Microprocessor Microprocessor Check Interface ASIC Interface ASIC ServerNet technology 7 An application-specific integrated circuit (ASIC) is a device that usually contains several hundred thousand gates, performs a wide variety of functions specific to the design, and significantly reduces the number of discrete components on a circuit board. Secondary cache Figure 3. Compaq NonStop® Himalaya server’s lockstepped microprocessor architecture Data Integrity for Compaq NonStop® Himalaya Servers 14 Compaq NonStop® Himalaya system storage: End-to-end checksums Compaq NonStop® Himalaya system communications: ServerNet technology After the data leaves the microprocessors, it is subject to glitches on the buses or defects in other components. In Compaq NonStop® Himalaya systems, data on all the buses is parity protected, and parity errors cause immediate interrupts to trigger error recovery or, if necessary, a processor halt. The microprocessors in I/O controllers are protected by parity checks, packet sequence numbers, and checksums to ensure that the data is not corrupted by bus or component errors. All messages are protected by checksums using a combination of hardware and software. ServerNet technology provides the communication network among processors and peripherals in Compaq NonStop® Himalaya servers. The ServerNet protocol provides the best data integrity, error detection, and fault isolation capabilities in the industry. Specifically, ServerNet technology includes the following capabilities: ➔ All command symbols are coded so that single-bit errors create an invalid symbol rather than a different command or data symbol. Receipt of an invalid symbol triggers a check of the physical connection. ➔ Each ServerNet packet has a 32-bit cyclic redundancy check (CRC) checksum to detect data or control errors. This is much more robust than simple parity or checksums used in other protocols. ➔ Routing information and CRC are checked in every ServerNet link. The first ServerNet link that detects a bad packet will mark the packet as incorrect and generate an interrupt with the location at which the bad packet was detected. This pinpoints the location of the error. ➔ The ServerNet routing tables are protected by parity checks. In addition, the ASICs that implement ServerNet technology are selfchecking components. ➔ Link-level flow control is included in the protocol to help alleviate network congestion and route around a device that is “babbling” on the network. ➔ End-to-end flow control included in the protocol helps prevent devices from injecting more packets into the network than can be handled efficiently, thus avoiding network saturation. This is accomplished by requiring a positive acknowledgment for every packet sent. ➔ If an acknowledgment is not received for a packet, the protocol automatically checks the end-to-end link and resends the packet if a transient error occurred. This check includes flushing any “stale” packets that could cause spurious errors once the link is again operating normally. One of the most important Compaq NonStop® Himalaya system data integrity features is endto-end checksums. The Compaq NonStop® Himalaya system disk driver software creates an end-to-end checksum, consisting of a 2-byte checksum appended to a standard 512-byte disk sector before data is written to a disk. For structured data such as SQL files, an additional endto-end checksum (called a block checksum) encodes data values, physical location of the data, and transaction information. The block checksum is included in the header of a block of data that ranges in size from 512 bytes to 4,096 bytes. Because these checksums are added in the microprocessor, they protect against errors in all the buses and components that manage the reading and writing of the data. They protect against corrupted data values, partial writes, and misplaced or misaligned data. When the data is read from disk, the checksum is checked to ensure that the data is correct and that the location from which the data is read is correct. If either the data or location is incorrect, appropriate corrective action, such as reading from the mirror disk, is taken. Data stored on tape, usually for backup purposes, automatically retains the disk checksums. An additional 2-byte checksum is added to the tape record header to protect against errors such as misaligned data that could occur when transferring the data from disk to tape. Data Integrity for Compaq NonStop® Himalaya Servers 15 These capabilities of the ServerNet technology prevent errors from propagating and immediately locate the source of the error. Other interconnect technologies do not provide these features and may require extensive troubleshooting to find the cause of the problem, which increases the repair time. ServerNet technology is an example of the Compaq NonStop® Himalaya system’s fail-fast design philosophy. Compaq NonStop® Himalaya system software Although it is not possible to prevent application errors, system vendors must ensure that the system software does not inadvertently change or overwrite correct data. There are many data integrity checks built into the Compaq NonStop® Kernel operating system. For example, Compaq NonStop® Himalaya systems require explicit access validation (source, address, and permissions must all be correct) for all reads and writes to memory. This prevents a processor from incorrectly overwriting a memory location of another processor. The Compaq NonStop® Kernel operating system also verifies that data structures and pointers are correct when they are used. For example, it checks pointers on both ends of a link in a doubly linked list to ensure they reference each other. Although other operating systems may perform similar verification, it is not as pervasive as in the Compaq NonStop® Kernel operating system. Another protection mechanism is that interrupts are queued rather than automatically given the highest priority. This avoids interrupt flooding caused by a stuck interrupt line on an I/O controller or other hardware that could spoof the protection mechanisms. The most important function that system software plays in protecting data is proper cleanup following errors or failures. When a computer fails for any reason, in-flight transactions may be partially completed, open files may be corrupted, memory values may be inaccurate, and the database may be left in an incorrect state. In the event of a failure, the Compaq NonStop® SQL/MP database, Compaq NonStop® Transaction Manager (NonStop® TM/MP) software, and Compaq NonStop®TUXEDO® transaction monitor ensure that in-flight transactions are aborted and that the database is returned to its last known good state, from which point transactions can be reapplied. Although other vendors provide similar transaction monitoring and database recovery facilities, Compaq NonStop® Himalaya systems have some natural advantages. The first is that the process-pair Compaq NonStop® Himalaya system architecture prevents single processor failures from causing a system failure. If a processor fails, backup processes in other processors take over within a few seconds. Open files remain open, and data is not corrupted. Therefore, Compaq NonStop® Himalaya systems have less opportunity to make an error following a crash because of their inherent robust availability features. Another natural advantage for Compaq NonStop® Himalaya systems is that all the software and hardware is built and tested together by the Tandem Division of Compaq. Most other vendors have to combine software built by several different vendors. In Compaq NonStop® Himalaya systems, the hardware, operating system, file system, storage system, database, and transaction monitor are all built, integrated, and tested as a single entity by a single vendor. There is no equivalent to Compaq NonStop® Himalaya system data integrity in other commercial computer systems. No vendor other than Compaq builds processors with lockstepped microprocessors, extensive parity checking on the buses, and end-to-end checksums. It is theoretically possible for a thirdparty vendor to develop end-to-end checksums, but it requires special hardware formatting of the disks, special controller firmware, and special driver software to create and write data checksums, data address and volume sequence numbers, and checksums of checksums. Compaq is the only vendor with the dedication to data integrity necessary to provide such features. Data Integrity for Compaq NonStop® Himalaya Servers 16 Conclusion:The Compaq NonStop® Himalaya system advantage Only Compaq NonStop® Himalaya systems demonstrate a true commitment to data integrity. As a result, Compaq NonStop® Himalaya systems are the clear leader, now and for the foreseeable future. Data integrity is a key requirement for business-critical computing. Stock exchanges, banks, telecommunications companies, and the transaction processing applications of most businesses cannot afford to risk the integrity of their data. An error in a single bit can result in a $1 million mistake or put lives at risk. Rapidly advancing computer systems are so complex that no one has figured out how to design for or test for all the potential timing problems, unexpected interactions, and nonrepeatable transient states that occur in the real world. A truly safe computing environment can only be achieved if data integrity is a primary design objective and after many years of maturing in a field environment. Industry-standard servers are unable to focus on data integrity to the same extent as Compaq NonStop® Himalaya systems. Intense competitive pressures prevent high-volume servers from taking the additional time to design and test advanced data integrity features. The additional costs associated with such special features create additional pressure to trade off data integrity for lower costs. Compaq NonStop® Himalaya systems are based on 25 years of experience with business-critical applications and enable us to avoid unexpected problems, either by initial design or in response to customer problems. Compaq NonStop® Himalaya systems are built and tested entirely by the Tandem Division of Compaq, which enables us to assure that all hardware and software is designed and built with the same strong dedication to data integrity. And, when an error does occur, our fail-fast architecture protects your data from contamination. The following features of Compaq NonStop® Himalaya systems are unmatched in the computer industry: ➔ Lockstepped microprocessors continually crosscheck each other’s output and immediately freeze if any difference is detected. This prevents bit flips caused by such things as high-energy particles, power fluctuations, manufacturing imperfections, and timing errors from being used in calculations and corrupting a database. ➔ Each hardware and software module is self-checking and immediately halts rather than permit an error to propagate—a concept known as the fail-fast design philosophy. This philosophy makes it possible to determine the source of errors and correct them. ➔ Compaq NonStop® Himalaya systems incorporate state-of-the-art memory detection and correction to correct single-bit errors, detect double-bit errors, and detect “nibble” errors (three or four bits in a row). Compaq has modified the vendor’s ECC to include address bits, which helps avoid reading from or writing to the wrong memory location. ➔ Compaq NonStop® Himalaya system hardware is continually checked for latent faults. A background memory “sniffer” checks the entire memory every few hours. The multiple data paths provided for fault tolerance are alternately used to ensure correct operation. Data Integrity for Compaq NonStop® Himalaya Servers 17 ➔ ServerNet technology provides command symbol encoding, a 32-bit CRC checksum to detect data or control errors, packets guaranteed to arrive in the correct order, flow control, and flushing of “stale” data. If an error is detected, the ServerNet protocol pinpoints the location of that error for corrective action. ➔ Data on all the buses is parity protected, and parity errors cause immediate interrupts to trigger error recovery. ➔ Microprocessors in I/O controllers are protected by parity checks, packet sequence numbers, and checksums to ensure that the data is not corrupted by bus or component errors. ➔ All messages are protected by checksums using a combination of hardware and software. ➔ Disk driver software provides an end-to-end checksum appended to a standard 512-byte disk sector. For structured data such as SQL files, an additional end-to-end checksum (called a block checksum) encodes data values, physical location of the data, and transaction information. These checksums protect against corrupted data values, partial writes, and misplaced or misaligned data. Tape backup software retains the disk checksums and adds an additional layer of protection. ➔ The Compaq NonStop® Kernel operating system verifies that data structures and pointers are correct when they are used. ➔ The Compaq NonStop® Kernel operating system requires explicit access validation (source, address, and permissions must all be correct) for all reads and writes to memory. This prevents a processor from incorrectly overwriting a memory location of another processor. ➔ Compaq NonStop® Himalaya systems are built and tested entirely by the Tandem Division of Compaq, making it much easier to ensure that all the hardware and software is designed and built with a strong dedication to data integrity. There is no equivalent to the Compaq NonStop® Himalaya server’s level of data integrity, today or in the foreseeable future. And there is no comparable commitment to ensuring the integrity of your data. References W. Baker, R. Horst, D. Sonnier, and W. Watson, “A Flexible ServerNet Based Fault-Tolerant Architecture,” Proc 25th International Symposium on Fault-Tolerant Computing, June 1995. J. Bartlett, et al.,“Fault Tolerance in Tandem Computer Systems,” in Reliable Computer Systems, D. Siewiorek and R. Swarz, eds. Digital Press, 1992 (also available as Tandem Technical Report 90.5). E. Hnatek, Integrated Circuit Quality and Reliability, Marcel Dekker, Inc., 1995. For More Information WEBSITE: www.compaq.com ©1999 Compaq Computer Corporation. All rights reserved. April 1999. Compaq, Himalaya, NonStop, ServerNet, and Tandem registered U.S. Patent and Trademark Office. TUXEDO is a registered trademark of Novell, Inc., exclusively licensed to BEA Systems, Inc. Other product names mentioned herein may be trademarks and/or registered trademarks of their respective companies. Technical specifications and availability are subject to change without notice. 98-0893