Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SiCortex Technical Summary Matt Reilly, Lawrence C. Stewart, Judson Leonard, and David Gingold December 2006 The SiCortex family of Linux® cluster systems takes High Performance Technical Computing (HPTC) a step beyond conventional clusters. SiCortex concentrates on power-efficient design and simultaneous tuning of silicon, microcode, and system software to deliver outstanding application performance per dollar, per watt, and per square foot. The Company’s initial product offering includes: •The SC5832, which is a 5.8 Teraflop system with up to 8 Terabytes of memory. The SC5832 fits into a single cabinet and draws 18 KW. •The SC648, which is a 648 Gigaflop system with up to 864 Gigabytes of memory. Two SC648 systems fit in a single 19” rack with room to spare. A single SC648 system draws 2 KW. Abstract This paper describes the hardware and software in the SiCortex systems, and discusses the ideas that motivated their design. Introduction SiCortex is introducing a range of Linux-based cluster computer systems optimized to deliver outstanding application performance per dollar, per watt, and per square foot. In recent years, there has been phenomenal growth in the use of clusters for high performance technical computing (HPTC). Modern clusters are typically built from uniprocessor or small SMP nodes, connected by networks ranging from Gigabit Ethernet to InfiniBand®. Software for these clusters typically runs the Linux operating system and uses the Message Passing Interface (MPI) for communication. Systems suitable for running cluster applications are defined not by an instruction set or a communications technology, but by these software standards: Linux and MPI. Why A New Cluster System? It’s easy to see why clusters dominate the HPTC market: they are cheaper per unit of peak computation than the shared-memory large server alternative. But users of current cluster systems encounter a number of unpleasant realities: •Commodity-based clusters seldom deliver more than a small fraction of their peak compute rate because real HPTC applications spend most of their time waiting for data from memory. •Applications on commodity-based clusters have not scaled well to large numbers of processors. As long as interprocessor communication is viewed as an I/O function, message operations will take more time than they should. •Commodity-based clusters are unreliable. While a single node in a cluster might offer mean-time-to-crash figures of a year or more, this is inadequate when systems are built from hundreds and thousands of nodes. •Commodity-based clusters use too much power. We set out to address each of these issues, and a few more. The Time-to-Solution Model We approached the design of our systems by looking at a range of HPTC applications and determining why they take so long to run. Our model of the time taken to complete a computation is Tsol = Tarith + Tmem + Tcomm The time to solution is the time spent doing arithmetic, the time spent waiting for memory, and the time spent waiting for communication. For more than a decade, microprocessor developers have been focused primarily on Tarith with a race toward higher clock frequencies. This focus has generally yielded improvements in performance for desktop application benchmarks, because they spend almost no SiCortex Technical Summary 2 time in communication and often fit in the processor's cache. This design emphasis has resulted in processors with truly spectacular peak floating-point capabilities. Our survey of technical applications indicated that typical HPTC programs spend the majority of their time waiting for memory. Ratios in the range of 5-80 floating-point operations per cache miss to main memory were typical. Since DRAM latencies in commodity-based systems are on the order of 100 ns, even an infinitely fast microprocessor will be limited by memory performance to a few hundred megaflops in most cases. We concluded that silicon technology has progressed to the point where Tarith is, for the most part, irrelevant. Further gains have to be made in the memory and communications components of the time to solution and by increases in parallelism. We began to look at HPTC applications as problems in data movement rather than number crunching. We chose to attack Tmem by building a system that supports an efficient parallel computing model that allows scaling to hundreds and thousands of processors. We modify our simple model of computation to account for scaling: Tsol(N) = Tarith/N + Tmem/N + Tcomm(N) By scaling to N communicating processes, we are able to spread the data movement task over N independent memory access streams. Scaling is, of course, limited by the cost of communication. And an application’s performance ultimately can be limited by other terms of a more complete time-to-solution model, including the serial component of the task (Amdahl’s Law), the time spent waiting due to load imbalance, time spent waiting due to OS interference, and time spent waiting for I/O. A Balanced Design Our hardware design was guided by a simple idea: while traditional clusters are built upon processor designs that emphasize calculation speed, the SiCortex cluster architecture aims to balance the components of arithmetic, memory, and communications in a way that delivers maximum performance per dollar, watt, and square foot. We started with a low power processor that let us pack six processors on a node chip. The processors share access to two interleaved memory controllers that allow up to sixteen concurrent memory accesses. We took advantage of the dense packaging by connecting the cluster nodes in a very low latency, high bandwidth, extremely reliable, interconnect fabric that minimizes the cost of communication. This fabric scales beyond what is achievable in commodity-based clusters with even the best communication hardware. While balance among the computation components is key, many HPTC applications demand high-speed parallel I/O. But putting disks physically inside the cluster is awkward and unnecessary. Instead, we designed our architecture to provide external I/O which can be connected to standard disk arrays and other I/O systems, and with enough capacity to accommodate I/O systems of enormous scale. The SiCortex clusters provide substantial I/O bandwidth (up to 108 independent PCI Express® ports in the SC5832). SiCortex Technical Summary 3 Reliability Although the commodity-based servers used to build traditional clusters are fairly reliable individually, their reliability is inadequate in systems that depend on hundreds or thousands of these boxes, plus additional switches, cabling, power, and air conditioning. We’ve taken a systems approach to reliability, based on integration, error correction, and redundancy. •By reducing the number of components to a minimum, we removed many potential sources of failure. •By incorporating aggressive error correcting code (ECC) and communications linkerror recovery, we reduced the most common sources of transient errors. •By using an interconnect with built-in triple redundancy and a power system with N+1 redundancy, we provide for continued operation even in the presence of component failures. Power Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use parallel links, giving higher bandwidth. And with a small, single-cabinet system we were able to use a single master clock, resulting in reduced synchronization delays. Our low-power design also has cascading benefits in reducing infrastructure costs such as building and air conditioning, and in reducing operational costs for electricity. For more information about power reduction, see “Why Power Matters” at www.sicortex.com. SiCortex Technical Summary 4 The SiCortex Systems A SiCortex SC5832 system is composed of 972 six-way SMP compute nodes connected by a low latency, high bandwidth, interconnect fabric. The SC5832 system is contained in a single stand-alone cabinet, as shown in Figure 1. #OOLING%XHAUST )/#ABLE 2ACEWAYS &ABRIC-IDPLANEAND 0ROCESSOR-ODULES #OOLING )NTAKE 3YSTEM3ERVICE0ROCESSOR 0OWER3UPPLIES %THERNET3WITCH FIGURE 1. The SiCortex SC5832 System The SC648 is composed of 108 compute nodes and is contained in a standard, 19 inch, equipment rack, as shown in Figure 2. A single rack can accommodate two SC648 systems. FIGURE 2. The SiCortex SC648 System SiCortex Technical Summary 5 Each node in the SiCortex system (Figure 3) consists of a single node chip and two standard DDR2 memory modules (DIMMS). The node chip contains six 64-bit processors, their L1 and L2 caches, two interleaved memory controllers (one for each DIMM), the interconnect fabric links and switch, a DMA Engine, and a PCI Express (PCIe®) interface. The PCIe is used for external I/O devices, and is only enabled on some nodes. Six 64-bit MIPS CPUs CPU CPU CPU CPU CPU CPU L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache DMA Engine Fabric Switch Coherent L2 Cache DDR-2 Controller PCI Express Controller DDR-2 Controller Node Chip DDR-2 DIMM From other nodes DDR-2 DIMM External I/O To other nodes Fabric Links FIGURE 3. SiCortex Node The nodes in a SiCortex system are connected to each other via a fabric based on the Kautz digraph. For more information about the SiCortex implementation of the Kautz topology, see “A New Generation of Cluster Interconnect” at www.sicortex.com. The diameter of the network (the greatest number of hops a message must take from source to destination) is proportional to the logarithm of the number of nodes in the system. This property results in very small network diameters. The diameter of the SC5832 is six for 972 nodes, compared with a diameter of at least 15 for a 3-D torus of 1024 nodes, while using half as many links. SiCortex has developed an approach to partitioning the Kautz graph into identical 27-node tiles, which lets us build a range of systems using a single processor module. Each tile forms a module. The SC5832 incorporates 36 modules, while the SC648 requires only 4. As shown in Figure 4, each module contains 27 node chips and 54 DIMMs. Of the 27 nodes on a module, three have their PCIe buses connected to EXPRESSMODULETM slots, and a fourth has a PCIe dual-gigabit Ethernet controller. EXPRESSMODULES are PCIe cards designed for server rather than desktop PCs. Each processor module also has a small dedicated microprocessor to assist with boot, diagnostics, and system management. This module service processor (MSP) connects to the node chips by JTAG-style scan chains and SiCortex Technical Summary 6 also connects to a 100 megabit Ethernet control network. The control network is managed by a high-reliability Linux server acting as the system service processor (SSP). .ODE#HIP &ABRIC #ONNECTOR 0#)EXPRESS -ODULES %THERNET -EMORY$)-- FIGURE 4. The 27-Node Module The SiCortex Node The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent caches, two interleaved memory interfaces, high speed I/O, and a programmable interface to the interconnect fabric. The processors are based on a low power 64-bit MIPS® implementation. Each processor has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and has a peak floating-point rate of 1 GFLOPs. The processor’s six-stage pipeline provides inorder execution of up to two instructions per cycle. This simple design dissipates less than one watt per processor core. The processor’s rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete. The node’s PCIe interface provides up to 2.5 GB/s of I/O bandwidth via a PCIe root complex controller. The SiCortex systems support PCIe adaptors for Ethernet, InfiniBand, and Fibre Channel. Both the PCIe controller and the DMA Engine (described in The Fabric Interconnect on page 8) have coherent access to the L2 cache. Inbound transfers that hit in the L2 cache replace the L2 cached data. Outbound transfers that hit in the L2 cache leave the data in the cache undisturbed. (This is a key feature for implementing low-latency MPI transfers.) SiCortex Technical Summary 7 Each DDR-2 controller supports up to 8 pipelined accesses to DRAM simultaneously. Supported DRAM configurations range from 1 to 8 GB per node and from 400 MHz to 800 MHz clock rates. System SC5832 SC648 Compute Nodes 972 108 Number of Processors 5832 648 Processor 500 MHz 1 GFLOPS (double-precision) MIPS64® 500 MHz 1 GFLOPS (double-precision) MIPS64 Interconnect Topology Diameter-6 Kautz Diameter-4 Kautz Interconnect Links 2916 @ 2 GB/s 324 @ 2 GB/s Memory per Node 1 to 8 GB 1 to 8 GB Memory per System 972 to 7776 GB 108 to 864 GB PCIe I/O 108 8-lane ports 12 8-lane ports Gigabit Ethernet I/O 72 ports 8 ports Input Power 18 KW 2 KW Physical Dimensions 56’’W x 56’’D x 72’’H 23’’W x 36’’D x 72’’H Operating System Linux Linux TABLE 1. SiCortex System Specifications The Fabric Interconnect Within the node chip, the fabric interconnect consists of three components: the DMA Engine, the fabric switch, and the fabric links. The DMA Engine connects the memory system to the fabric switch, and implements the processors’ software interface to the fabric. The fabric switch forwards traffic between incoming and outgoing links, and to and from the DMA Engine. The fabric links, three receivers and three transmitters per node, connect directly to other nodes in the system. For more information about the SiCortex fabric interconnect, see “A New Generation of Cluster Interconnect” at www.sicortex.com. The DMA Engine The DMA Engine provides a high-bandwidth interface between the memory system and the fabric switch, relieving software of the low-level work of repetitively creating packets of memory data and injecting them into the fabric, or accepting packets from the fabric and distributing their payload to appropriate locations in memory. The DMA Engine is designed to work closely with both privileged kernel-level device drivers and user-level library software to provide very low overhead transfers in a protected virtual-memory environment. Low overhead requires that typical transfers can be initiated and completed without invoking kernel-mode or interrupt-level software at either the sender or receiver side, and that buffers need not be copied. The DMA Engine is microcoded, allowing it to be retargeted to protocols other than MPI. The programmability greatly reduces the complexity of the logic required to dispatch, reformat, and transfer messages to and from user-mode processes. The DMA Engine cooperates SiCortex Technical Summary 8 with the Linux kernel so that MPI send and receive operations are handled safely and efficiently, entirely in user mode. The Fabric Switch The fabric switch in the node chip (shown in Figure 5) connects three inbound links to three outbound links and to the DMA Engine that originates messages into the fabric. The core of the switch is a 3x3 crossbar that provides paths from each of the three fabric inputs to each of the three fabric outputs. We add three independent inputs from the DMA Engine to allow it to originate three packet streams into the fabric simultaneously. We add three more independent outputs to the DMA Engine to allow it to sink three packet streams from the fabric. From DMA Engine Transmit Ports To DMA Engine Receive Ports Store-andForward Packet Buffer Replay Buffer From Fabric Receive Ports To Fabric Transmit Ports FIGURE 5. The Fabric Switch Each crosspoint contains 16 full-packet buffers with ECC. The switch implements a virtual channel cut-through router. Cut-through allows packets to pass through the switch with minimal delay and the virtual channel implementation prevents deadlock. Packets are source-routed. As a packet arrives on an input port, the switch extracts routing information from the first word. The routing instruction indicates which port the message will leave from and thus, which store-and-forward buffer will capture the incoming packet. The DMA Engine inserts the routing instruction for the entire trip at the front of each packet when it transmits the packet to the fabric switch. SiCortex Technical Summary 9 The Fabric Topology A distinguishing characteristic of the SiCortex systems is their use of a fabric topology based on an idea originally developed by William Kautz in the 1960s.1 Until now, the Kautz topology has found little use due to the difficulty in routing its many complex paths. But it provides a number of attractive features that include: •Logarithmic diameter: The maximum hops a message must make scales with the log of the number of nodes. This reduces the transit time for a message through the fabric and, more importantly, reduces network congestion because each message spends less time occupying resources in the fabric. •Fixed degree: Systems of any size can be built from nodes that have a fixed number of input and output ports. •Redundant paths: The removal or failure of any one node in the system increases the diameter of the network by only one hop. No other nodes become unreachable. A breakthrough that led us to use this topology in our interconnect was our development of an efficient partitioning of the Kautz graph, allowing us to build systems using identical 27node modules. In the SiCortex systems, the fabric connects the nodes into a degree-3 directed Kautz graph. All links are unidirectional, and each node has three input links and three output links that connect it to other nodes. The 972 nodes of the SC5832 form a diameter-6 graph. The 108 nodes of the SC648 form a diameter-4 graph. For more information about the SiCortex fabric topology, see “A New Generation of Cluster Interconnect” at www.sicortex.com. Reliability Considerations The SC5832 configuration contains over 15 Gbits of L1 and L2 cache storage, and up to 62 Tbits of DRAM storage. It also has more than 52,000 wires comprising the communication fabric. SiCortex has taken special care in the design of its systems to identify and tolerate transient faults in storage and communication paths. Memory Errors All memory structures within the SiCortex cluster are protected so that the system can recover gracefully from single-bit errors. In data caches, message buffers, and DRAM arrays (main memory), all structures are protected by a single-bit error correction, doublebit error detection code. Single-bit errors are corrected transparently and logged by the system for off-line analysis. Double-bit errors are detected and typically force a node to 1. Kautz, W.H. “Bounds on directed (d,k) graphs,” in Theory of cellular logic networks and machines, AFCRL-68-0668 Final report, pp. 20-28, 1968. SiCortex Technical Summary 10 reboot. Instruction caches are protected by parity; parity errors are logged, but treated by the processor as instruction-cache misses. Based on transient-error models,2 we would expect a single-bit error in a SiCortex node every two years. But with 972 nodes in the system, we’ll see a single-bit error every 16 hours. It’s easy to see why ECC might not be necessary in a commodity-based system designed for the desktop, but when systems are built out of hundreds and thousands of components, error rates are such that ECC becomes mandatory. The double-bit, transient-error rate is substantially lower (by orders of magnitude) than the single-bit error rate, because a double-bit error requires either an upset event that affects two or more bits, or two upset events occurring in the same word at different times. This former case requires very energetic particles that are quite rare. The latter case is extremely improbable; failure times are in excess of 1 billion hours in machine-room environments. Communication Errors The SC5832 fabric transports data at an aggregate rate of more than 50 x 1012 bits per second. At this rate, single-bit errors are inevitable. Predicting these bit-error rates accurately during system design is difficult. Instead, we have assumed that bit errors are extremely frequent and have designed the fabric to recover. Links carry messages through the fabric in packets. As a link transmits each packet, it writes a cyclic redundancy check word (CRC) into the last word of the packet. The receiving link checks the CRC and, if no discrepancy is found, the receiver acknowledges receipt of the packet. If the receiver detects an error, it asks for a retransmission of all packets since the last correctly received and acknowledged packet. This mechanism maintains packet delivery order at the expense of a replay buffer in the transmitting end of each link. As a result of the per-link, error-control strategy and ECC on all intermediate storage, the communication software can treat the fabric communications hardware as a reliable, inorder channel. This approach removes the burden of low-level error recovery in software that is imposed, for instance, by Ethernet-based solutions. Cooling With 972 nodes in a single cabinet, the maximum power dissipation of an SC5832 is 18 kilowatts. Reliability of the system is greatly enhanced by efficient cooling that maintains low component temperatures. The SC5832 has a large cooling intake aperture in the middle of the cabinet (see Figure 1). Air enters the upper cabinet from the front and back sides of this intake, passes through the fan tray, cools the 27-node modules, and exits from the top of the machine. The system regulates the speed of the cooling fans, keeping the die temperatures of the node chips below 100 degrees Celsius for machine room temperatures up to 35 degrees Celsius. The 2. We use a simple model of MTTE = 1012 / 4 * bits based on an installation at 7000 feet above sea level. SiCortex Technical Summary 11 cooling system is capable of moving more than 4000 cubic feet of air per minute through the machine. The cooling fans (see Figure 6) are multiply redundant. The fan tray can be replaced without taking the system out of service. FIGURE 6. Card Cage and Blowers The SiCortex Software SiCortex provides a complete suite of software tools designed for users to develop and run applications on our systems. The suite includes: •Boot and diagnostic software •A complete Linux operating system: kernel, device drivers, and applications •Compilers •Libraries •Debuggers •Performance analysis tools. Most of the software components for the SC5832 are open source. Boot and Diagnostic Software The nodes of the SC5832 have no boot ROM. Instead, their initial programs are loaded over JTAG-compatible scan chains by a uClinux-based microprocessor on each 27-node module. The MSPs communicate over a control network with a high-reliability Linux server acting as the SSP. The SSP is responsible for diagnostics, logging, and system management. SiCortex Technical Summary 12 The Linux Kernel The SiCortex systems run one Linux kernel on each six-processor SMP node. The SiCortex Linux kernel is based on the kernel.org 2.6 sources, with modifications from linuxmips.org to support the MIPS processors, and patches from Cluster File Systems® to support the Lustre® file system. SiCortex also adds system-specific changes to support the node chip’s interrupt system and ECC logic, 64 KB virtual memory pages, and virtual memory integration for the DMA Engine. On a node, the Linux kernel runs three SiCortex-specific device drivers: •The fabric driver, which supports use of the DMA Engine and fabric by application MPI libraries. •The SCethernet driver, which is a Linux network driver that transmits IP frames over the fabric. •The MSP driver, which supports network and console communications with the MSP and the SSP. Networking The SCethernet device driver provides network connectivity between all nodes in the system, transmitting IP frames over the fabric. With SCethernet, all nodes appear to be on a single network segment. For external network connectivity, the SC5832 system includes 72 Gigabit Ethernet ports. The 108 PCIe module ports can be used to provide additional network connections. Customers can configure the nodes in the system either to use externally visible IP addresses or to use network address translation (NAT) via a number of gateway nodes. The system supports both IPv4 and IPv6 address allocation. The File Systems The SiCortex systems access storage in several ways: •The systems can be connected directly to external storage arrays by attaching Fibre Channel or InfiniBand adapters to the nodes that have PCIe I/O interfaces. These I/O nodes can then provide storage access to other nodes in the system, acting either as Lustre servers (using the Lustre parallel file system developed by Cluster File Systems) or as NFS servers. This configuration provides exceptional high-performance access to large application data sets. •SiCortex nodes can use networked file systems, including both the Lustre file system and NFS, to connect to external file servers. Such a configuration can provide access to user home directories, or to file servers that provide application data which is shared with other computing resources. •The SSP includes a RAID array that is used to boot cluster nodes and provide their root file system data, as well as to store management information and system logs. SiCortex Technical Summary 13 For more information about parallel I/O with the SiCortex systems., see “The Lustre High Performance File System” at www.sicortex.com. The Linux Distribution The SiCortex Linux environment is derived from the Gentoo Linux distribution. Gentoo is a source-based distribution that supports many different processor architectures, including MIPS. The Gentoo system provides powerful control over optimization, package dependencies, and support for both 32-bit and 64-bit programming interfaces. SiCortex Linux software releases include binary packages for simple installation and upgrade, as well as the Gentoo tools for building from source code. For more information about Gentoo Linux, see www.gentoo.org. Compilers and Tool Chains SiCortex provides two compiler suites for our systems: GNU compilers supporting C and C++, and the QLogicTM PathScaleTM compiler suite supporting Fortran 77, Fortran 95, C, and C++. The GNU compilers are gcc version 4.1. Binaries produced by both sets of compilers are interoperable. With both compiler suites, users can choose to build their software using native compilers, which run on the SiCortex nodes, or using cross-compilers, which run on external x86 Linux systems. The SiCortex software includes the GNU gdb debugger. Customers can also license the TotalView® debugger, supported for the SiCortex systems, from Etnus®. For more information about QLogic PathScale compilers, see www.pathscale.com. For more information about the TotalView debugger, see www.etnus.com. MPI The SiCortex MPI library is a cornerstone of our high-performance computing software environment. The MPI library allows user-level applications direct access to our fabric interconnect hardware, without OS system calls in communication-critical paths. We have optimized the library to provide extremely low latency for short messages, to use zerocopy high-bandwidth RDMA primitives for long messages, and to take advantage of specialized DMA Engine features that accelerate MPI collective operations. Our MPI implementation is derived from the popular MPICH2 software from Argonne National Laboratory. At present, we support all MPI-1 and selected MPI-2 features. For more information about the SiCortex MPI implementation, see “A New Generation of Cluster Interconnect” at www.sicortex.com. SiCortex Technical Summary 14 Libraries The SiCortex software includes a rich suite of mathematics and scientific library packages that are important to developers of scientific software. These include: BLAS, LINPACK, FFT, LAPACK, CBLAS, BLACS, Hypre, ScaLAPACK, METIS, ParMETIS, PETSc, SUNDIALS, SuperLU, SPRNG, NetCDF, HDF4, and HDF5. We invest substantial effort porting, optimizing, testing, and maintaining these libraries so that users do not need to do so. For information about the software libraries included with the SiCortex system, see “The SiCortex Application Development Environment” at www.sicortex.com. Performance Tools Recognizing that performance optimization is a critical and difficult step in developing HPTC applications, SiCortex includes a comprehensive set of tools that allow users to both instrument and analyze the performance of their applications. These tools include PAPI, TAU, Vampir NG, and other packages. In may cases, we have adapted these tools to take advantage of specialized performance-measurement abilities of the SiCortex hardware. For more information about the performance management tools, see “The SiCortex Application Development Environment” at www.sicortex.com. Job Control and System Management Common to industry practice, the SiCortex systems use resource management software to coordinate the execution of jobs, with the assistance of a job queue or scheduling manager. Our resource manager, based on the SLURM system from Lawrence Livermore National Laboratory, provides a mechanism for executing jobs across the available cluster resources including CPUs, memory, and communications resources. SLURM is capable of running a variety of job schedulers, such as Maui, or the commercially available Moab, and LSF schedulers. The SiCortex system management software collects errors, environmental, and performance data, and makes this data available to users and system managers through a variety of monitoring interfaces. For more information about the SLURM system, see www.llnl.gov/linux/slurm. Conclusion The SiCortex systems, including our flagship SC5832, are Linux/MPI clusters, only more so. They offer nearly six Teraflops, with up to eight Terabytes of memory, six Terabytes per second of interconnect, and 250 Gigabytes per second of I/O in one box, drawing only 18 Kilowatts of power. They deliver outstanding performance to existing applications. SiCortex Technical Summary 15 Trademark and Copyright Information Copyright© 2006 SiCortex Incorporated. All rights reserved. SiCortex Technical Summary The following are trademarks of their respective companies or organizations: Etnus and TotalView are registered trademarks of Etnus, LLC. EXPRESSMODULE is the trademark of PCI-SIG. InfiniBand is the registered trademark of the InfiniBand Trade Association. Intel is the registered trademark of Intel Corporation. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. The registered trademark Linux is used pursuant to a sublicense from the Linux Mark Institute, the exclusive licensee of Linus Torvalds, owner of the mark in the U.S. and other countries. Lustre is the registered trademark of Cluster File Systems, Inc. MIPS and MIPS64 are registered trademarks of MIPS Technologies, Inc. PathScale, QLogic PathScale, and combinations thereof are trademarks of PathScale, Inc. PCI, PCI Express, and PCIe are the registered trademarks of PCI-SIG. SiCortex Technical Summary 16