Download HPCC - Chapter1 - Auburn Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-core processor wikipedia , lookup

Stream processing wikipedia , lookup

ILLIAC IV wikipedia , lookup

Distributed computing wikipedia , lookup

Data-intensive computing wikipedia , lookup

Application Interface Specification wikipedia , lookup

Parallel computing wikipedia , lookup

Supercomputer wikipedia , lookup

Supercomputer architecture wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
High Performance Cluster Computing
Architectures and Systems
Book Editor: Rajkumar Buyya
Slides Prepared by: Hai Jin
Internet and Cluster Computing Center
Introduction

Need more computing power

Improve the operating speed of processors &
other components

constrained by the speed of light, thermodynamic
laws, & the high financial costs for processor
fabrication

Connect multiple processors together & coordinate
their computational efforts


2
parallel computers
allow the sharing of a computational task among
multiple processors
Era of Computing

Rapid technical advances


the recent advances in VLSI technology
software technology



grand challenge applications have become the
main driving force
Parallel computing


3
OS, PL, development methodologies, & tools
one of the best ways to overcome the speed
bottleneck of a single processor
good price/performance ratio of a small clusterbased parallel computer
Need of more Computing Power:
Grand Challenge Applications
Solving technology problems using computer
modeling, simulation and analysis
Geographic
Information
Systems
Life Sciences
4
CAD/CAM
Aerospace
Digital Biology
Military Applications
Parallel Computer Architectures

Taxonomy







5
based on how processors, memory & interconnect are laid out
Massively Parallel Processors (MPP)
Symmetric Multiprocessors (SMP)
Cache-Coherent Nonuniform Memory Access
(CC-NUMA)
Distributed Systems
Clusters
Grids
Parallel Computer Architectures

MPP



A large parallel processing system with a shared-nothing
architecture
Consist of several hundred nodes with a high-speed
interconnection network/switch
Each node consists of a main memory & one or more
processors


SMP




6
Runs a separate copy of the OS
2-64 processors today
Shared-everything architecture
All processors share all the global resources available
Single copy of the OS runs on these systems
Parallel Computer Architectures

CC-NUMA



Distributed systems




considered conventional networks of independent computers
have multiple system images as each node runs its own OS
the individual machines could be combinations of MPPs, SMPs,
clusters, & individual computers
Clusters



7
a scalable multiprocessor system having a cache-coherent
nonuniform memory access architecture
every processor has a global view of all of the memory
a collection of workstations of PCs that are interconnected by a
high-speed network
work as an integrated collection of resources
have a single system image spanning all its nodes
Towards Low Cost Parallel
Computing

Parallel processing



linking together 2 or more computers to jointly solve some
computational problem
since the early 1990s, an increasing trend to move away
from expensive and specialized proprietary parallel
supercomputers towards networks of workstations
the rapid improvement in the availability of commodity high
performance components for workstations and networks
 Low-cost commodity supercomputing


8
from specialized traditional supercomputing platforms to
cheaper, general purpose systems consisting of loosely
coupled components built up from single or multiprocessor
PCs or workstations
need to standardization of many of the tools and utilities
used by parallel applications (ex) MPI, HPF
Windows of Opportunities

Parallel Processing


Network RAM





Redundant array of inexpensive disks
Use the arrays of workstation disks to provide cheap, highly
available, & scalable file storage
Possible to provide parallel I/O support to applications
Use arrays of workstation disks to provide cheap, highly
available, and scalable file storage
Multipath Communication

9
Use memory associated with each workstation as aggregate
DRAM cache
Software RAID


Use multiple processors to build MPP/DSM-like systems for
parallel computing
Use multiple networks for parallel data transfer between
nodes
Cluster Computer and its
Architecture


A cluster is a type of parallel or distributed processing
system, which consists of a collection of interconnected
stand-alone computers cooperatively working together as a
single, integrated computing resource
A node





10
a single or multiprocessor system with memory, I/O facilities,
& OS
generally 2 or more computers (nodes) connected together
in a single cabinet, or physically separated & connected via a
LAN
appear as a single system to users and applications
provide a cost-effective way to gain features and benefits
Cluster Computer Architecture
11
Prominent Components of
Cluster Computers (I)

Multiple High Performance
Computers
PCs
 Workstations
 SMPs (CLUMPS)
 Distributed HPC Systems leading to
Metacomputing

12
Prominent Components of
Cluster Computers (III)

High Performance Networks/Switches








13
Ethernet (10Mbps),
Fast Ethernet (100Mbps),
Gigabit Ethernet (1Gbps)
SCI (Dolphin - MPI- 12micro-sec latency)
ATM
Myrinet (1.2Gbps)
Digital Memory Channel
FDDI
Prominent Components of
Cluster Computers (V)

Fast Communication Protocols and
Services




14
Active Messages (Berkeley)
Fast Messages (Illinois)
U-net (Cornell)
XTP (Virginia)
Prominent Components of
Cluster Computers (VI)

Cluster Middleware



Hardware


DEC Memory Channel, DSM (Alewife, DASH), SMP Techniques
Operating System Kernel/Gluing Layers


Single System Image (SSI)
System Availability (SA) Infrastructure
Solaris MC, Unixware, GLUnix
Applications and Subsystems



Applications (system management and electronic forms)
Runtime systems (software DSM, PFS etc.)
Resource management and scheduling software (RMS)

15
CODINE, LSF, PBS, NQS, etc.
Prominent Components of
Cluster Computers (VII)

Parallel Programming Environments and Tools

Threads (PCs, SMPs, NOW..)



MPI






16

C/C++/Java
Parallel programming with C++ (MIT Press book)
RAD (rapid application development tools)


Linux, NT, on many Supercomputers
PVM
Software DSMs (TreadMark)
Compilers


POSIX Threads
Java Threads
GUI based tools for PP modeling
Debuggers
Performance Analysis Tools
Visualization Tools
Prominent Components of
Cluster Computers (VIII)

Applications


Sequential
Parallel / Distributed (Cluster-aware app.)

Grand Challenge applications






17
Weather Forecasting
Quantum Chemistry
Molecular Biology Modeling
Engineering Analysis (CAD/CAM)
……………….
PDBs, web servers,data-mining
Key Operational Benefits of
Clustering
High Performance
 Expandability and Scalability
 High Throughput
 High Availability

18
Clusters Classification (III)

Node Hardware

Clusters of PCs (CoPs)

Piles of PCs (PoPs)
Clusters of Workstations (COWs)
 Clusters of SMPs (CLUMPs)

19
Clusters Classification (V)

Node Configuration

Homogeneous Clusters


Heterogeneous Clusters

20
All nodes will have similar architectures
and run the same OSs
All nodes will have different architectures
and run different OSs
Clusters Classification (VI)

Levels of Clustering

Group Clusters (#nodes: 2-99)





Departmental Clusters (#nodes: 10s to 100s)
Organizational Clusters (#nodes: many 100s)
National Metacomputers (WAN/Internetbased)
International Metacomputers (Internet-based,
#nodes: 1000s to many millions)



21
Nodes are connected by SAN like Myrinet
Metacomputing
Web-based Computing
Agent Based Computing

Java plays a major in web and agent based computing
Commodity Components for Clusters
(III)

Disk and I/O


Overall improvement in disk access time has
been less than 10% per year
Amdahl’s law


Parallel I/O

22
Speed-up obtained from faster processors is
limited by the slowest system component
Carry out I/O operations in parallel,
supported by parallel file system based on
hardware or software RAID
Cluster Middleware & SSI

SSI


Supported by a middleware layer that resides between the
OS and user-level environment
Middleware consists of essentially 2 sublayers of SW
infrastructure

SSI infrastructure


System availability infrastructure

23
Glue together OSs on all nodes to offer unified access to
system resources
Enable cluster services such as checkpointing, automatic
failover, recovery from failure, & fault-tolerant support
among all nodes of the cluster
What is Single System Image (SSI) ?



24
A single system image is the illusion,
created by software or hardware,
that presents a collection of
resources as one, more powerful
resource.
SSI makes the cluster appear like a
single machine to the user, to
applications, and to the network.
A cluster without a SSI is not a
cluster
SSI Boundaries -- an applications
SSI boundary
Batch System
SSI
Boundary
25
(c) In search
of clusters
Single System Image Benefits





26
Provide a simple, straightforward view of all system
resources and activities, from any node of the
cluster
Free the end user from having to know where an
application will run
Free the operator from having to know where a
resource is located
Let the user work with familiar interface and
commands and allows the administrators to manage
the entire clusters as a single entity
Reduce the risk of operator errors, with the result
that end users see improved reliability and higher
availability of the system
Single System Image Benefits (Cont’d)







27
Allowing centralize/decentralize system management and
control to avoid the need of skilled administrators from
system administration
Present multiple, cooperating components of an application to
the administrator as a single application
Greatly simplify system management
Provide location-independent message communication
Help track the locations of all resource so that there is no
longer any need for system operators to be concerned with
their physical location
Provide transparent process migration and load balancing
across nodes.
Improved system response time and performance
Resource Management and
Scheduling (RMS)



RMS is the act of distributing applications among computers to
maximize their throughput
Enable the effective and efficient utilization of the resources
available
Software components

Resource manager


Resource scheduler


Queueing applications, resource location and assignment
Reasons using RMS





28 
Locating and allocating computational resource, authentication, process
creation and migration
Provide an increased, and reliable, throughput of user applications on
the systems
Load balancing
Utilizing spare CPU cycles
Providing fault tolerant systems
Manage access to powerful system, etc
Basic architecture of RMS: client-server system
Services provided by RMS

Process Migration




Checkpointing
Scavenging Idle Cycles





29
Computational resource has become too heavily loaded
Fault tolerant concern
70% to 90% of the time most workstations are idle
Fault Tolerance
Minimization of Impact on Users
Load Balancing
Multiple Application Queues
Computing Platforms Evolution
Breaking Administrative Barriers
2100
2100
2100
2100
2100
2100
2100
?
P
E
R
F
O
R
M
A
N
C
E
2100
Administrative Barriers
Individual
Group
Department
Campus
State
National
Globe
Inter Planet
Universe
Desktop
30
2100
(Single Processor)
SMPs or
SuperCom
puters
Local
Cluster
Enterprise
Cluster/Grid
Global
Inter Planet
Cluster/Grid Cluster/Grid ??
Why Do We Need Metacomputing?

Our computational needs are infinite,
whereas our financial resources are
finite
users will always want more & more
powerful computers
 try & utilize the potentially hundreds of
thousands of computers that are
interconnected in some unified way
 need seamless access to remote
resources

31
Towards Grid Computing….
32
What is Grid ?

An infrastructure that couples






33
Computers (PCs, workstations, clusters, traditional
supercomputers, and even laptops, notebooks, mobile
computers, PDA, and so on) …
Software (e.g., renting expensive special purpose
applications on demand)
Databases (e.g., transparent access to human genome
database)
Special Instruments (e.g., radio telescope-SETI@Home Searching for Life in galaxy,
Austrophysics@Swinburne for pulsars)
People (may be even animals who knows ?)
Across the Internet and presents them as
an unified integrated (single) resource
http://www.csse.monash.edu.au/~rajkumar/ecogrid/
Conceptual view of the Grid
Leading to Portal (Super)Computing
34
Grid Application-Drivers

Old and new applications getting enabled due
to coupling of computers, databases,
instruments, people, etc.
(distributed) Supercomputing
 Collaborative engineering
 High-throughput computing


large scale simulation & parameter studies
Remote software access / Renting Software
 Data-intensive computing
 On-demand computing

35
The Grid Impact
“The global computational grid is
expected to drive the economy of the
21st century similar to the electric
power grid that drove the economy of
the 20th century”
36
Metacomputer Design Objectives
and Issues (II)

Underlying Hardware and Software
Infrastructure


37
A metacomputing environment must be able to
operate on top of the whole spectrum of
current and emerging HW & SW technology
An ideal environment will provide access to the
available resources in a seamless manner such
that physical discontinuities such as difference
between platforms, network protocols, and
administrative boundaries become completely
transparent
Metacomputer Design Objectives
and Issues (III)

Middleware – The Metacomputing Environment

Communication services


Directory/registration services


provide the mechanism for registering and obtaining
information about the metacomputer structure,
resources, services, and status
Processes, threads, and concurrency control

38
needs to support protocols that are used for bulkdata transport, streaming data, group
communications, and those used by distributed
objects
share data and maintain consistency when multiple
processes or threads have concurrent access to it
Metacomputer Design Objectives
and Issues (V)

Middleware – The Metacomputing Environment

Security and authorization






System status and fault tolerance
Resource management and scheduling

39
confidentiality: prevent disclosure of data
integrity: prevent tampering with data
authorization: verify identity
accountability: knowing whom to blame
efficiently and effectively schedule the applications
that need to utilize the available resource in the
metacomputing environment
Metacomputer Design Objectives
and Issues (VI)

Middleware – The Metacomputing Environment

Programming tools and paradigms




User and administrative GUI


intuitive and easy to use interface to the services and
resources available
Availability

40
include interface, APIs, and conversion tools so as to provide a
rich development environment
support a range of programming paradigms
a suite of numerical and other commonly used libraries should
be available
easily port on to a range of commonly used platforms, or use
technologies that enable it to be platform neutral
Metacomputing Projects

Globus (from Argonne National Laboratory)


Legion (from the University of Virginia)


provides a high-level unified object model out of
new and existing components to build a
metasystem
Webflow (from Syracuse University)

41
provides a toolkit on a set of existing components
to build metacomputing environments
provides a Web-based metacomputing environment
Globus (I)

A computational grid


A layered architecture


high-level global services are built upon essential low-level core
local services
Globus Toolkit (GT)





42
A hardware and software infrastructure to provide dependable,
consistent, and pervasive access to high-end computational
capabilities, despite the geographical distribution of both
resources and users
a central element of the Globus system
defines the basic services and capabilities required to construct
a computational grid
consists of a set of components that implement basic services
provides a bag of services
only possible when the services are distinct and have well-defined
interfaces (API)
Globus (II)



Globus Alliance
http://www.globus.org
GT 3.0





GT 4.0 (2005)





43
Resources management (GRAM)
Information Service (MDS)
Data Management (GridFTP)
Security (GSI)
Execution management
Information Services
Data management
Security
Common runtime (WS)
The Impact of Metacomputing


44
Metacomputing is an infrastructure that
can bond and unify globally remote and
diverse resources
At some stage in the future, our computing
needs will be satisfied in same pervasive and
ubiquitous manner that we use the
electricity power grid