Download SiCortex Technical Summary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Direct memory access wikipedia , lookup

Kademlia wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Distributed operating system wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
SiCortex Technical Summary
Matt Reilly, Lawrence C. Stewart,
Judson Leonard, and David Gingold
December 2006
The SiCortex family of Linux® cluster systems takes High Performance Technical Computing (HPTC) a step beyond conventional clusters. SiCortex concentrates on power-efficient
design and simultaneous tuning of silicon, microcode, and system software to deliver outstanding application performance per dollar, per watt, and per square foot. The Company’s
initial product offering includes:
•The SC5832, which is a 5.8 Teraflop system with up to 8 Terabytes of memory. The
SC5832 fits into a single cabinet and draws 18 KW.
•The SC648, which is a 648 Gigaflop system with up to 864 Gigabytes of memory.
Two SC648 systems fit in a single 19” rack with room to spare. A single SC648 system draws 2 KW.
Abstract
This paper describes the hardware and software in the SiCortex systems, and discusses the
ideas that motivated their design.
Introduction
SiCortex is introducing a range of Linux-based cluster computer systems optimized to
deliver outstanding application performance per dollar, per watt, and per square foot.
In recent years, there has been phenomenal growth in the use of clusters for high performance technical computing (HPTC). Modern clusters are typically built from uniprocessor
or small SMP nodes, connected by networks ranging from Gigabit Ethernet to InfiniBand®.
Software for these clusters typically runs the Linux operating system and uses the Message
Passing Interface (MPI) for communication. Systems suitable for running cluster applications are defined not by an instruction set or a communications technology, but by these
software standards: Linux and MPI.
Why A New Cluster System?
It’s easy to see why clusters dominate the HPTC market: they are cheaper per unit of peak
computation than the shared-memory large server alternative. But users of current cluster
systems encounter a number of unpleasant realities:
•Commodity-based clusters seldom deliver more than a small fraction of their peak
compute rate because real HPTC applications spend most of their time waiting for
data from memory.
•Applications on commodity-based clusters have not scaled well to large numbers of
processors. As long as interprocessor communication is viewed as an I/O function,
message operations will take more time than they should.
•Commodity-based clusters are unreliable. While a single node in a cluster might offer
mean-time-to-crash figures of a year or more, this is inadequate when systems are
built from hundreds and thousands of nodes.
•Commodity-based clusters use too much power.
We set out to address each of these issues, and a few more.
The Time-to-Solution Model
We approached the design of our systems by looking at a range of HPTC applications and
determining why they take so long to run.
Our model of the time taken to complete a computation is
Tsol = Tarith + Tmem + Tcomm
The time to solution is the time spent doing arithmetic, the time spent waiting for memory,
and the time spent waiting for communication.
For more than a decade, microprocessor developers have been focused primarily on Tarith
with a race toward higher clock frequencies. This focus has generally yielded improvements in performance for desktop application benchmarks, because they spend almost no
SiCortex Technical Summary
2
time in communication and often fit in the processor's cache. This design emphasis has
resulted in processors with truly spectacular peak floating-point capabilities.
Our survey of technical applications indicated that typical HPTC programs spend the majority of their time waiting for memory. Ratios in the range of 5-80 floating-point operations
per cache miss to main memory were typical. Since DRAM latencies in commodity-based
systems are on the order of 100 ns, even an infinitely fast microprocessor will be limited by
memory performance to a few hundred megaflops in most cases. We concluded that silicon
technology has progressed to the point where Tarith is, for the most part, irrelevant. Further
gains have to be made in the memory and communications components of the time to
solution and by increases in parallelism.
We began to look at HPTC applications as problems in data movement rather than number
crunching. We chose to attack Tmem by building a system that supports an efficient parallel
computing model that allows scaling to hundreds and thousands of processors. We modify
our simple model of computation to account for scaling:
Tsol(N) = Tarith/N + Tmem/N + Tcomm(N)
By scaling to N communicating processes, we are able to spread the data movement task
over N independent memory access streams. Scaling is, of course, limited by the cost of
communication. And an application’s performance ultimately can be limited by other terms
of a more complete time-to-solution model, including the serial component of the task
(Amdahl’s Law), the time spent waiting due to load imbalance, time spent waiting due to
OS interference, and time spent waiting for I/O.
A Balanced Design
Our hardware design was guided by a simple idea: while traditional clusters are built
upon processor designs that emphasize calculation speed, the SiCortex cluster architecture aims to balance the components of arithmetic, memory, and communications in a
way that delivers maximum performance per dollar, watt, and square foot.
We started with a low power processor that let us pack six processors on a node chip. The
processors share access to two interleaved memory controllers that allow up to sixteen concurrent memory accesses. We took advantage of the dense packaging by connecting the
cluster nodes in a very low latency, high bandwidth, extremely reliable, interconnect fabric
that minimizes the cost of communication. This fabric scales beyond what is achievable in
commodity-based clusters with even the best communication hardware.
While balance among the computation components is key, many HPTC applications
demand high-speed parallel I/O. But putting disks physically inside the cluster is awkward
and unnecessary. Instead, we designed our architecture to provide external I/O which can
be connected to standard disk arrays and other I/O systems, and with enough capacity to
accommodate I/O systems of enormous scale. The SiCortex clusters provide substantial
I/O bandwidth (up to 108 independent PCI Express® ports in the SC5832).
SiCortex Technical Summary
3
Reliability
Although the commodity-based servers used to build traditional clusters are fairly reliable
individually, their reliability is inadequate in systems that depend on hundreds or thousands of these boxes, plus additional switches, cabling, power, and air conditioning.
We’ve taken a systems approach to reliability, based on integration, error correction, and
redundancy.
•By reducing the number of components to a minimum, we removed many potential
sources of failure.
•By incorporating aggressive error correcting code (ECC) and communications linkerror recovery, we reduced the most common sources of transient errors.
•By using an interconnect with built-in triple redundancy and a power system with
N+1 redundancy, we provide for continued operation even in the presence of component failures.
Power
Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a
small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use
parallel links, giving higher bandwidth. And with a small, single-cabinet system we were
able to use a single master clock, resulting in reduced synchronization delays.
Our low-power design also has cascading benefits in reducing infrastructure costs such as
building and air conditioning, and in reducing operational costs for electricity.
For more information about power reduction, see “Why Power Matters” at www.sicortex.com.
SiCortex Technical Summary
4
The SiCortex Systems
A SiCortex SC5832 system is composed of 972 six-way SMP compute nodes connected
by a low latency, high bandwidth, interconnect fabric. The SC5832 system is contained
in a single stand-alone cabinet, as shown in Figure 1.
#OOLING%XHAUST
)/#ABLE
2ACEWAYS
&ABRIC-IDPLANEAND
0ROCESSOR-ODULES
#OOLING
)NTAKE
3YSTEM3ERVICE0ROCESSOR
0OWER3UPPLIES
%THERNET3WITCH
FIGURE 1. The SiCortex SC5832 System
The SC648 is composed of 108 compute nodes and is contained in a standard, 19 inch,
equipment rack, as shown in Figure 2. A single rack can accommodate two SC648 systems.
FIGURE 2. The SiCortex SC648 System
SiCortex Technical Summary
5
Each node in the SiCortex system (Figure 3) consists of a single node chip and two standard DDR2 memory modules (DIMMS). The node chip contains six 64-bit processors, their
L1 and L2 caches, two interleaved memory controllers (one for each DIMM), the interconnect fabric links and switch, a DMA Engine, and a PCI Express (PCIe®) interface. The PCIe
is used for external I/O devices, and is only enabled on some nodes.
Six 64-bit MIPS CPUs
CPU
CPU
CPU
CPU
CPU
CPU
L1 Cache
L1 Cache
L1 Cache
L1 Cache
L1 Cache
L1 Cache
DMA Engine
Fabric Switch
Coherent L2 Cache
DDR-2
Controller
PCI
Express
Controller
DDR-2
Controller
Node
Chip
DDR-2 DIMM
From other
nodes
DDR-2 DIMM
External I/O
To other
nodes
Fabric Links
FIGURE 3. SiCortex Node
The nodes in a SiCortex system are connected to each other via a fabric based on the Kautz
digraph. For more information about the SiCortex implementation of the Kautz topology, see
“A New Generation of Cluster Interconnect” at www.sicortex.com. The diameter of the network (the greatest number of hops a message must take from source to destination) is proportional to the logarithm of the number of nodes in the system. This property results in
very small network diameters. The diameter of the SC5832 is six for 972 nodes, compared
with a diameter of at least 15 for a 3-D torus of 1024 nodes, while using half as many
links.
SiCortex has developed an approach to partitioning the Kautz graph into identical 27-node
tiles, which lets us build a range of systems using a single processor module. Each tile
forms a module. The SC5832 incorporates 36 modules, while the SC648 requires only 4.
As shown in Figure 4, each module contains 27 node chips and 54 DIMMs. Of the 27
nodes on a module, three have their PCIe buses connected to EXPRESSMODULETM slots,
and a fourth has a PCIe dual-gigabit Ethernet controller. EXPRESSMODULES are PCIe
cards designed for server rather than desktop PCs. Each processor module also has a small
dedicated microprocessor to assist with boot, diagnostics, and system management. This
module service processor (MSP) connects to the node chips by JTAG-style scan chains and
SiCortex Technical Summary
6
also connects to a 100 megabit Ethernet control network. The control network is managed
by a high-reliability Linux server acting as the system service processor (SSP).
.ODE#HIP
&ABRIC
#ONNECTOR
0#)EXPRESS
-ODULES
%THERNET
-EMORY$)--
FIGURE 4. The 27-Node Module
The SiCortex Node
The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent
caches, two interleaved memory interfaces, high speed I/O, and a programmable interface
to the interconnect fabric.
The processors are based on a low power 64-bit MIPS® implementation. Each processor
has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB
segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and
has a peak floating-point rate of 1 GFLOPs. The processor’s six-stage pipeline provides inorder execution of up to two instructions per cycle. This simple design dissipates less than
one watt per processor core.
The processor’s rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete.
The node’s PCIe interface provides up to 2.5 GB/s of I/O bandwidth via a PCIe root complex controller. The SiCortex systems support PCIe adaptors for Ethernet, InfiniBand, and
Fibre Channel.
Both the PCIe controller and the DMA Engine (described in The Fabric Interconnect on
page 8) have coherent access to the L2 cache. Inbound transfers that hit in the L2 cache
replace the L2 cached data. Outbound transfers that hit in the L2 cache leave the data in
the cache undisturbed. (This is a key feature for implementing low-latency MPI transfers.)
SiCortex Technical Summary
7
Each DDR-2 controller supports up to 8 pipelined accesses to DRAM simultaneously. Supported DRAM configurations range from 1 to 8 GB per node and from 400 MHz to 800
MHz clock rates.
System
SC5832
SC648
Compute Nodes
972
108
Number of Processors
5832
648
Processor
500 MHz 1 GFLOPS
(double-precision)
MIPS64®
500 MHz 1 GFLOPS
(double-precision)
MIPS64
Interconnect Topology
Diameter-6 Kautz
Diameter-4 Kautz
Interconnect Links
2916 @ 2 GB/s
324 @ 2 GB/s
Memory per Node
1 to 8 GB
1 to 8 GB
Memory per System
972 to 7776 GB
108 to 864 GB
PCIe I/O
108 8-lane ports
12 8-lane ports
Gigabit Ethernet I/O
72 ports
8 ports
Input Power
18 KW
2 KW
Physical Dimensions
56’’W x 56’’D x 72’’H
23’’W x 36’’D x 72’’H
Operating System
Linux
Linux
TABLE 1. SiCortex System Specifications
The Fabric Interconnect
Within the node chip, the fabric interconnect consists of three components: the DMA
Engine, the fabric switch, and the fabric links. The DMA Engine connects the memory system to the fabric switch, and implements the processors’ software interface to the fabric.
The fabric switch forwards traffic between incoming and outgoing links, and to and from
the DMA Engine. The fabric links, three receivers and three transmitters per node, connect
directly to other nodes in the system. For more information about the SiCortex fabric interconnect, see “A New Generation of Cluster Interconnect” at www.sicortex.com.
The DMA Engine
The DMA Engine provides a high-bandwidth interface between the memory system and
the fabric switch, relieving software of the low-level work of repetitively creating packets
of memory data and injecting them into the fabric, or accepting packets from the fabric
and distributing their payload to appropriate locations in memory. The DMA Engine is
designed to work closely with both privileged kernel-level device drivers and user-level
library software to provide very low overhead transfers in a protected virtual-memory environment. Low overhead requires that typical transfers can be initiated and completed
without invoking kernel-mode or interrupt-level software at either the sender or receiver
side, and that buffers need not be copied.
The DMA Engine is microcoded, allowing it to be retargeted to protocols other than MPI.
The programmability greatly reduces the complexity of the logic required to dispatch, reformat, and transfer messages to and from user-mode processes. The DMA Engine cooperates
SiCortex Technical Summary
8
with the Linux kernel so that MPI send and receive operations are handled safely and efficiently, entirely in user mode.
The Fabric Switch
The fabric switch in the node chip (shown in Figure 5) connects three inbound links to
three outbound links and to the DMA Engine that originates messages into the fabric. The
core of the switch is a 3x3 crossbar that provides paths from each of the three fabric
inputs to each of the three fabric outputs. We add three independent inputs from the DMA
Engine to allow it to originate three packet streams into the fabric simultaneously. We add
three more independent outputs to the DMA Engine to allow it to sink three packet
streams from the fabric.
From DMA Engine
Transmit Ports
To DMA Engine
Receive Ports
Store-andForward
Packet Buffer
Replay
Buffer
From Fabric
Receive Ports
To Fabric
Transmit Ports
FIGURE 5. The Fabric Switch
Each crosspoint contains 16 full-packet buffers with ECC. The switch implements a virtual
channel cut-through router. Cut-through allows packets to pass through the switch with
minimal delay and the virtual channel implementation prevents deadlock.
Packets are source-routed. As a packet arrives on an input port, the switch extracts routing
information from the first word. The routing instruction indicates which port the message
will leave from and thus, which store-and-forward buffer will capture the incoming packet.
The DMA Engine inserts the routing instruction for the entire trip at the front of each packet
when it transmits the packet to the fabric switch.
SiCortex Technical Summary
9
The Fabric Topology
A distinguishing characteristic of the SiCortex systems is their use of a fabric topology
based on an idea originally developed by William Kautz in the 1960s.1 Until now, the
Kautz topology has found little use due to the difficulty in routing its many complex paths.
But it provides a number of attractive features that include:
•Logarithmic diameter: The maximum hops a message must make scales with the
log of the number of nodes. This reduces the transit time for a message through the
fabric and, more importantly, reduces network congestion because each message
spends less time occupying resources in the fabric.
•Fixed degree: Systems of any size can be built from nodes that have a fixed number
of input and output ports.
•Redundant paths: The removal or failure of any one node in the system increases
the diameter of the network by only one hop. No other nodes become unreachable.
A breakthrough that led us to use this topology in our interconnect was our development of
an efficient partitioning of the Kautz graph, allowing us to build systems using identical 27node modules.
In the SiCortex systems, the fabric connects the nodes into a degree-3 directed Kautz
graph. All links are unidirectional, and each node has three input links and three output
links that connect it to other nodes. The 972 nodes of the SC5832 form a diameter-6
graph. The 108 nodes of the SC648 form a diameter-4 graph.
For more information about the SiCortex fabric topology, see “A New Generation of Cluster
Interconnect” at www.sicortex.com.
Reliability Considerations
The SC5832 configuration contains over 15 Gbits of L1 and L2 cache storage, and up to
62 Tbits of DRAM storage. It also has more than 52,000 wires comprising the communication fabric. SiCortex has taken special care in the design of its systems to identify and tolerate transient faults in storage and communication paths.
Memory Errors
All memory structures within the SiCortex cluster are protected so that the system can
recover gracefully from single-bit errors. In data caches, message buffers, and DRAM
arrays (main memory), all structures are protected by a single-bit error correction, doublebit error detection code. Single-bit errors are corrected transparently and logged by the
system for off-line analysis. Double-bit errors are detected and typically force a node to
1. Kautz, W.H. “Bounds on directed (d,k) graphs,” in Theory of cellular logic networks and
machines, AFCRL-68-0668 Final report, pp. 20-28, 1968.
SiCortex Technical Summary
10
reboot. Instruction caches are protected by parity; parity errors are logged, but treated by
the processor as instruction-cache misses.
Based on transient-error models,2 we would expect a single-bit error in a SiCortex node
every two years. But with 972 nodes in the system, we’ll see a single-bit error every 16
hours. It’s easy to see why ECC might not be necessary in a commodity-based system
designed for the desktop, but when systems are built out of hundreds and thousands of
components, error rates are such that ECC becomes mandatory.
The double-bit, transient-error rate is substantially lower (by orders of magnitude) than the
single-bit error rate, because a double-bit error requires either an upset event that affects
two or more bits, or two upset events occurring in the same word at different times. This
former case requires very energetic particles that are quite rare. The latter case is extremely
improbable; failure times are in excess of 1 billion hours in machine-room environments.
Communication Errors
The SC5832 fabric transports data at an aggregate rate of more than 50 x 1012 bits per
second. At this rate, single-bit errors are inevitable. Predicting these bit-error rates accurately during system design is difficult. Instead, we have assumed that bit errors are
extremely frequent and have designed the fabric to recover.
Links carry messages through the fabric in packets. As a link transmits each packet, it
writes a cyclic redundancy check word (CRC) into the last word of the packet. The receiving link checks the CRC and, if no discrepancy is found, the receiver acknowledges receipt
of the packet. If the receiver detects an error, it asks for a retransmission of all packets since
the last correctly received and acknowledged packet. This mechanism maintains packet
delivery order at the expense of a replay buffer in the transmitting end of each link.
As a result of the per-link, error-control strategy and ECC on all intermediate storage, the
communication software can treat the fabric communications hardware as a reliable, inorder channel. This approach removes the burden of low-level error recovery in software
that is imposed, for instance, by Ethernet-based solutions.
Cooling
With 972 nodes in a single cabinet, the maximum power dissipation of an SC5832 is 18
kilowatts. Reliability of the system is greatly enhanced by efficient cooling that maintains
low component temperatures.
The SC5832 has a large cooling intake aperture in the middle of the cabinet (see Figure 1).
Air enters the upper cabinet from the front and back sides of this intake, passes through the
fan tray, cools the 27-node modules, and exits from the top of the machine. The system
regulates the speed of the cooling fans, keeping the die temperatures of the node chips
below 100 degrees Celsius for machine room temperatures up to 35 degrees Celsius. The
2. We use a simple model of MTTE = 1012 / 4 * bits based on an installation at 7000 feet above sea
level.
SiCortex Technical Summary
11
cooling system is capable of moving more than 4000 cubic feet of air per minute through
the machine.
The cooling fans (see Figure 6) are multiply redundant. The fan tray can be replaced without taking the system out of service.
FIGURE 6. Card Cage and Blowers
The SiCortex Software
SiCortex provides a complete suite of software tools designed for users to develop and run
applications on our systems. The suite includes:
•Boot and diagnostic software
•A complete Linux operating system: kernel, device drivers, and applications
•Compilers
•Libraries
•Debuggers
•Performance analysis tools.
Most of the software components for the SC5832 are open source.
Boot and Diagnostic Software
The nodes of the SC5832 have no boot ROM. Instead, their initial programs are loaded
over JTAG-compatible scan chains by a uClinux-based microprocessor on each 27-node
module. The MSPs communicate over a control network with a high-reliability Linux
server acting as the SSP. The SSP is responsible for diagnostics, logging, and system management.
SiCortex Technical Summary
12
The Linux Kernel
The SiCortex systems run one Linux kernel on each six-processor SMP node. The SiCortex
Linux kernel is based on the kernel.org 2.6 sources, with modifications from linuxmips.org to support the MIPS processors, and patches from Cluster File Systems® to support the Lustre® file system. SiCortex also adds system-specific changes to support the
node chip’s interrupt system and ECC logic, 64 KB virtual memory pages, and virtual
memory integration for the DMA Engine.
On a node, the Linux kernel runs three SiCortex-specific device drivers:
•The fabric driver, which supports use of the DMA Engine and fabric by application
MPI libraries.
•The SCethernet driver, which is a Linux network driver that transmits IP frames over
the fabric.
•The MSP driver, which supports network and console communications with the MSP
and the SSP.
Networking
The SCethernet device driver provides network connectivity between all nodes in the system, transmitting IP frames over the fabric. With SCethernet, all nodes appear to be on a
single network segment.
For external network connectivity, the SC5832 system includes 72 Gigabit Ethernet ports.
The 108 PCIe module ports can be used to provide additional network connections.
Customers can configure the nodes in the system either to use externally visible IP
addresses or to use network address translation (NAT) via a number of gateway nodes. The
system supports both IPv4 and IPv6 address allocation.
The File Systems
The SiCortex systems access storage in several ways:
•The systems can be connected directly to external storage arrays by attaching Fibre
Channel or InfiniBand adapters to the nodes that have PCIe I/O interfaces. These I/O
nodes can then provide storage access to other nodes in the system, acting either as
Lustre servers (using the Lustre parallel file system developed by Cluster File Systems) or as NFS servers. This configuration provides exceptional high-performance
access to large application data sets.
•SiCortex nodes can use networked file systems, including both the Lustre file system
and NFS, to connect to external file servers. Such a configuration can provide access
to user home directories, or to file servers that provide application data which is
shared with other computing resources.
•The SSP includes a RAID array that is used to boot cluster nodes and provide their
root file system data, as well as to store management information and system logs.
SiCortex Technical Summary
13
For more information about parallel I/O with the SiCortex systems., see “The Lustre High
Performance File System” at www.sicortex.com.
The Linux Distribution
The SiCortex Linux environment is derived from the Gentoo Linux distribution. Gentoo is a
source-based distribution that supports many different processor architectures, including
MIPS. The Gentoo system provides powerful control over optimization, package dependencies, and support for both 32-bit and 64-bit programming interfaces. SiCortex Linux
software releases include binary packages for simple installation and upgrade, as well as
the Gentoo tools for building from source code.
For more information about Gentoo Linux, see www.gentoo.org.
Compilers and Tool Chains
SiCortex provides two compiler suites for our systems: GNU compilers supporting C and
C++, and the QLogicTM PathScaleTM compiler suite supporting Fortran 77, Fortran 95, C,
and C++.
The GNU compilers are gcc version 4.1. Binaries produced by both sets of compilers are
interoperable.
With both compiler suites, users can choose to build their software using native compilers,
which run on the SiCortex nodes, or using cross-compilers, which run on external x86
Linux systems.
The SiCortex software includes the GNU gdb debugger. Customers can also license the
TotalView® debugger, supported for the SiCortex systems, from Etnus®.
For more information about QLogic PathScale compilers, see www.pathscale.com. For
more information about the TotalView debugger, see www.etnus.com.
MPI
The SiCortex MPI library is a cornerstone of our high-performance computing software
environment. The MPI library allows user-level applications direct access to our fabric
interconnect hardware, without OS system calls in communication-critical paths. We have
optimized the library to provide extremely low latency for short messages, to use zerocopy high-bandwidth RDMA primitives for long messages, and to take advantage of specialized DMA Engine features that accelerate MPI collective operations.
Our MPI implementation is derived from the popular MPICH2 software from Argonne
National Laboratory. At present, we support all MPI-1 and selected MPI-2 features.
For more information about the SiCortex MPI implementation, see “A New Generation of
Cluster Interconnect” at www.sicortex.com.
SiCortex Technical Summary
14
Libraries
The SiCortex software includes a rich suite of mathematics and scientific library packages
that are important to developers of scientific software. These include: BLAS, LINPACK,
FFT, LAPACK, CBLAS, BLACS, Hypre, ScaLAPACK, METIS, ParMETIS, PETSc, SUNDIALS, SuperLU, SPRNG, NetCDF, HDF4, and HDF5. We invest substantial effort porting,
optimizing, testing, and maintaining these libraries so that users do not need to do so.
For information about the software libraries included with the SiCortex system, see “The
SiCortex Application Development Environment” at www.sicortex.com.
Performance Tools
Recognizing that performance optimization is a critical and difficult step in developing
HPTC applications, SiCortex includes a comprehensive set of tools that allow users to
both instrument and analyze the performance of their applications. These tools include
PAPI, TAU, Vampir NG, and other packages. In may cases, we have adapted these tools
to take advantage of specialized performance-measurement abilities of the SiCortex hardware.
For more information about the performance management tools, see “The SiCortex Application Development Environment” at www.sicortex.com.
Job Control and System Management
Common to industry practice, the SiCortex systems use resource management software to
coordinate the execution of jobs, with the assistance of a job queue or scheduling manager. Our resource manager, based on the SLURM system from Lawrence Livermore
National Laboratory, provides a mechanism for executing jobs across the available cluster
resources including CPUs, memory, and communications resources. SLURM is capable of
running a variety of job schedulers, such as Maui, or the commercially available Moab,
and LSF schedulers.
The SiCortex system management software collects errors, environmental, and performance data, and makes this data available to users and system managers through a variety
of monitoring interfaces.
For more information about the SLURM system, see www.llnl.gov/linux/slurm.
Conclusion
The SiCortex systems, including our flagship SC5832, are Linux/MPI clusters, only more
so. They offer nearly six Teraflops, with up to eight Terabytes of memory, six Terabytes per
second of interconnect, and 250 Gigabytes per second of I/O in one box, drawing only 18
Kilowatts of power. They deliver outstanding performance to existing applications.
SiCortex Technical Summary
15
Trademark and Copyright Information
Copyright© 2006 SiCortex Incorporated. All rights reserved.
SiCortex Technical Summary
The following are trademarks of their respective companies or organizations:
Etnus and TotalView are registered trademarks of Etnus, LLC.
EXPRESSMODULE is the trademark of PCI-SIG.
InfiniBand is the registered trademark of the InfiniBand Trade Association.
Intel is the registered trademark of Intel Corporation.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. The registered trademark Linux is used pursuant to a sublicense from the Linux Mark Institute, the
exclusive licensee of Linus Torvalds, owner of the mark in the U.S. and other countries.
Lustre is the registered trademark of Cluster File Systems, Inc.
MIPS and MIPS64 are registered trademarks of MIPS Technologies, Inc.
PathScale, QLogic PathScale, and combinations thereof are trademarks of PathScale, Inc.
PCI, PCI Express, and PCIe are the registered trademarks of PCI-SIG.
SiCortex Technical Summary
16