Download Slide Set 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Intelligent maintenance system wikipedia , lookup

Prognostics wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
High-Performance,
Dependable Multiprocessor
John David Eriksen
Jamie Unger-Fink
Background and Motivation

Traditional space computing limited primarily
to mission-critical applications
◦ Spacecraft control
◦ Life support


Data collected in space and processed on the
ground
Data sets in space applications continue to
grow
Background and Motivation

Communication bandwidth not growing fast
enough to cope with increasing size of data
sets
◦ Instruments and sensors grow in capability

Increasing need for on-board data processing
◦ Perform data filtering and other operations onboard
◦ Autonomous systems demand more computing
power
Related Work

Advanced Onboard Signal Processor (AOSP)
◦ Developed in 70’s and 80’s
◦ Helped develop understanding of radiation on
computing systems and components.

Advanced Architecture Onboard Processor
(AAOP)
◦ Engineered new approaches to onboard data
processing
Related Work

Space Touchstone
◦ First COTS-based, FT, high-performance system

Remote Exploration and Experimentation
◦ Extended FT techniques to parallel and cluster
computing
◦ Focused on low-cost, high-performance, good
power-ratio compute cluster designs.
Goal


Address need for increased data processing
requirements
Bring COTS systems to space
◦ COTS (Commodity Off-The-Shelf)
 Less expensive
 General-purpose
 Need special considerations to meet requirements of
aerospace environments
 Fault-tolerance
 High reliability
 High availability
Dependable Multiprocessor
is…

A reconfigurable cluster computer with
centralized control.
Dependable Multiprocessor is…

A hardware architecture
◦ High-performance characteristics
◦ Scalable
◦ Upgradable (thanks to reliance on COTS)

A parallel processing environment
◦ Support common scientific computing
development environment (FEMPI)

A fault-tolerant computing platform
◦ System controllers provide FT properties

A toolset for predicting application behavior
◦ Fault behavior, performance, availability…
Hardware Architecture





Redundant radiation-hardened system
controller
Cluster of COTS-based reconfigurable data
processors
Redundant COTS-based packet-switched
networks
Radiation-hardened mass data store
Redundancy available in:
◦ System controller
◦ Network
◦ Configurable N-of-M sparing in compute nodes
Hardware Architecture
Hardware Architecture

Scalability
◦ Variable number of compute nodes
◦ Cluster-of-cluster

Compute nodes
◦ IBM PowerPC 750FX general processor
◦ Xilinx VirtexII 6000 FPGA co-processor
 Reconfigurable to fulfill various roles
 DSP processor
 Data compression
 Vector processing
 Applications implemented in hardware can be very
fast
◦ Memory and other support chips
Hardware Architecture
Hardware Architecture
Hardware Architecture

Network Interconnect
◦ Gigabit Ethernet for data exchange
◦ A low-latency, low-bandwidth bus used for control

Mission Interface
◦ Provides interface to rest of space vehicle’s
computer systems
◦ Radiation-hardened
Hardware Architecture

Current hardware implementation
Four data processors
Two redundant system controllers
One mass data store
Two gigabit ethernet networks including two
network switches
◦ Software-controlled instrumented power supply
◦ Workstation running spacecraft system emulator
software
◦
◦
◦
◦
Hardware Architecture

Platform layer is lowest layer, interfaces
hardware to middleware, hardware-specific
software, network drivers
◦ Uses Linux, allows for use of many existing
software tools


Mission Layer:
Middleware: includes DM System Services:
fault tolerance, job management, etc.



DM Framework is application independent,
platform independent
API to communicate with mission layer, SAL
(System Abstraction Layer) for platform layer
Allows for future applications by facilitating
porting to new platforms


HA Middleware foundation includes:
Availability Management (AMS), Distributed
Messaging (DMS), Cluster Management (CMS)
Primary functions
◦
◦
◦
◦
◦

Resource monitoring
Fault detection, diagnosis, recovery and reporting
Cluster configuration
Event logging
Distributed messaging
Based on small, cross-platform kernel


Hosted on the cluster’s system controller
Managed Resources include:
◦
◦
◦
◦
◦
◦
◦
◦
◦
Applications
Operating System
Chassis
I/O cards
Redundant CPUs
Networks
Peripherals
Clusters
Other middleware





Provides a reliable messaging layer for
communications in DM cluster
Used for Checkpointing, Client/server,
Communications, Event notification, Fault
management, Time-critical communications
Application opens a DMS connection (channel)
to pass data to interested subscribers
Since messaging is in middleware instead of
lower layers, application doesn’t have to
specify explicitly destination address
Messages are classified and machines choose
to receive message of a certain type



Manages physical nodes or instances of HA
middleware
Discovers and monitors nodes in a cluster
Passes node failures to AMS and FT Manager
via DMS



Database Management
Logging Services
Tracing



Interface to control computer or ground
station
Communicates with system via DMS
Monitors system health with FT Manager
◦ “Heartbeat”



Detects and recovers from system faults
FTM refers to set of recovery policies at
runtime
Relies on distributed software agents to
gather system and application liveliness
information
◦ Avoids monitoring bottleneck





Provides application scheduling, resource
allocation
Opportunistic load balancing scheduler
Jobs are registered and trace by the JM via
tables
Checkpointing to allow seamless recovery of
the JM
Heartbeats to the FT via middleware

Fault-Tolerant Embedded Message Passing
Interface
◦ Application independent FT middleware
◦ Message Passing Interface (MPI) Standard
◦ Built on top of HA middleware



Recovery from failure should be automatic, with
minimal impact
Needs to maintain global awareness of the
processes in parallel applications
3 Stages:
◦ Fault Detection
◦ Notification
◦ Recovery


Process failures vs Network failures
Survives the crash of n-1 processes in an nprocess job


Proprietary nature of FPGA industry
USURP - USURP’s Standard for Unified
Reconfigurable Platforms
◦
◦
◦
◦
Standard to interact with hardware
Provides middleware for portability
Black box IP cores
Wrappers mask FPGA board



Not a universal tool for mapping high-level
code with hardware design
OpenFPGA
Adaptive Computing System (ACS) vs USURP
◦ Object Oriented Models vs Software APIs



IGOL
BLAST
CARMA

Responsible for:
 Unifying vendor APIs
 Standardizing HW
interface
 Organization of data for
the user application core
 Exposing the developer
to common FPGA
resources.


User level protocol for system recovery
Consists of:
◦ Server Process that runs on Mass Data Store
 DMS
◦ API for applications
 C-type interfaces



Algorithm-based Fault Tolerance Library
Collection of mathematical routines that can
detect and correct faults
BLAS-3 Library
◦ Matrix multiply, LU decomposition, QR
decomposition, single-value decompositions
(SVD) and fast Fourier transform (FFT).

Uses checksums


Triple Modular Redunancy
Process Level Replication
Conclusion



System architecture has been defined
Testbench has been assembled
Improvements:
◦ More aggressively address power consumption
issues
◦ Add support for other scientific computing
platforms such as Fortran