* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download System Software Architecture
Survey
Document related concepts
Transcript
BlueGene/L System Software Derek Lieber IBM T. J. Watson Research Center February 2004 Topics Programming environment Compilation Execution Debugging Programming model Processors Memory Files Communications What happens under the covers 2 Programming on BG/L A single application program image Running on tens of thousands of compute nodes Communicating via message passing Each image has its own copy of Memory File descriptors 3 Programming on BG/L A “job” is encapsulated in a single host-side process A merge point for compute node stdout streams A control point for Signaling (ctl-c, kill, etc) Debugging (attach, detach) Termination (exit status collection and summary) 4 Programming on BG/L Cross compile the source code Place executable onto BG/L machine’s shared filesystem Run it “blrun <job information> <program name> <args>” Stdout of all program instances appears as stdout of blrun Files go to user-specified directory on shared filesystem blrun terminates when all program instances terminate Killing blrun kills all program instances 5 Compiling and Running on BG/L stdout Workstation datafiles shared filesystem sources BG/L Machine programs + datafiles program local filesystem cross-tools 6 Programming Models “Coprocessor model” 64k instances of a single application program each has 255M address space each with two threads (main, coprocessor) non-coherent shared memory “Virtual node model” 128k instances 127M address space one thread (main) 7 Programming Model Does a job behave like A group of processes? Or a group of threads? A little bit of each 8 A process group? Yes Each program instance has its own Memory File descriptors No Can’t communicate via mmap, shmat Can’t communicate via pipes or sockets Can’t communicate via signals (kill) 9 A thread group? Yes Job terminates when All program instances terminate via exit(0) Any program instance terminates Voluntarily, via exit(!0) Involuntarily, via uncaught signal (kill, abort, segv, etc) No Each program instance has own set of file descriptors Each has own private memory space 10 Compilers and libraries GNU C, Fortran, C++ compilers can be used with BG/L, but they do not exploit 2nd FPU IBM xlf/xlc compilers have been ported to BG/L, with code generation and optimization features for dual FPU Standard glibc library MPI for communications 11 System calls Traditional ANSI + “a little” POSIX I/O Time Open, close, read, write, etc Gettimeofday, etc Signal catchers Synchronous (sigsegv, sigbus, etc) Asynchronous (timers and hardware events) 12 System calls No “unix stuff” No system calls needed to access most hardware fork, exec, pipe mount, umount, setuid, setgid Tree and torus fifos Global OR Mutexes and barriers Performance counters Mantra Keep the compute nodes simple Kernel stays out of the way and lets the application program run 13 Software Stack in BG/L Compute Node CNK controls all access to hardware, and enables bypass for application use User-space libraries and applications can directly access torus and tree through bypass As a policy, user-space code should not directly touch hardware, but there is no enforcement of that policy Application code User-space libraries CNK Bypass BG/L ASIC 14 What happens under the covers? The machine The job allocation, launch, and control system The machine monitoring and control system 15 The machine Nodes IO nodes Compute nodes Link nodes Communications networks Ethernet Tree Torus Global OR JTAG 16 The IO nodes 1024 nodes talk to outside world via ethernet talk to inside world via tree network not connected to torus embedded linux kernel purpose is to run network filesystem job control daemons 17 The compute nodes 64k nodes, each with 2 cpus and 4 fpus application programs execute here custom kernel non-preemptive kernel and application share same address space application program has full control of all timing issues kernel is memory protected kernel provides program load / start / debug / termination file access all via message passing to IO nodes 18 The link nodes Signal routing, no computation Stitch together cards and racks of io and compute nodes into “blocks” suitable for running independent jobs Isolate each block’s tree, torus, and global OR network 19 Machine configuration machine manager jtag/ethernet host core link link 20 Kernel booting and monitoring jtag/ethernet machine manager host core …1024… ciod ciod ciod cnk …64… cnk cnk cnk …64… …64… cnk cnk 21 Job execution blrun blrun tcp/ethernet host core …1024… ciod ciod tree ciod cnk …64… cnk cnk cnk …64… …64… cnk cnk 22 Blue Gene/L System Software Architecture tree I/O Node 0 Front-end Nodes Console File Servers Pset 0 C-Node 0 C-Node 63 CNK CNK Linux ciod Service Service Node Node DB2 Ethernet MMCS I/O Node 1023 I2C Scheduler torus C-Node 0 C-Node 63 CNK CNK Linux ciod Ethernet IDo chip JTAG Pset 1023 23 Conclusions BG/L system software stack has BG/L system software must scale to very large machines Custom solution (CNK) on compute nodes for high performance Linux solution on I/O nodes for flexibility and functionality MPI as default programming model Hierarchical organization for management Flat organization for programming Mixed conventional/special-purpose operating systems Many challenges ahead, particularly in performance, scalability and reliability 24