Download Grid Research at The University of Hong Kong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Grid Computing in Hong Kong
Dr. Cho-Li Wang
Systems Research Group
Department of Computer Science and Information Systems
The University of Hong Kong
1
Agenda



Grid computing – a simple picture
The Hong Kong Grid
SRG Projects





SLIM, ODGPC
G-JavaMPI
JESSICA2
LOTS DSM for Grid
Summary and Conclusion
2
Grid Computing :
A Simple Picture


Much like “utilities” in our daily
lives – electricity, water, etc.
Advantages:



Cost-effectiveness
Platform extensibility
Convenience (P&P)
CPU power,
Memory,
Network,
Storage…
Data..
Services..
Resource providers
Grid Computing
Access to
remote resources
via
standard protocols
for
cross-domain
collaboration
End users
3
Grid Computing in Hong Kong -The Hong Kong Grid
The experimental grid in HK
Supported under HKU Foundation Seed Grant
http://www.hkgrid.org/
4
The Hong Kong Grid (HKGrid)

Goals:



to construct and make available a grid test bed
to facilitate the development of grid
middleware and applications by local industry
and institutions in Hong Kong and their
partners in the region
to demonstrate the benefits of adopting grid
technologies and to showcase any outstanding
results of development or application
HKGrid provides a platform for its
members to experiment with various
research prototypes and pilot applications
5
HKGrid - Current constituents
Institutions
Computing facilities
City University of HK
Service gateway (2-way Xeon
SMP)
HK Baptist University
2-way Xeon SMP x 64
(#300 in TOP500, 6/2003)
HK University of Science and
Technology
4-way SMP cluster
The HK Polytechnic University
Service gateway (2-way Xeon
SMP)
The HK Institute of HPC
Service gateway (2-way Xeon
SMP)
HKU – Computer Centre
2-way Xeon SMP x 128
(#240 in TOP500, 11/2003)
HKU – Department of CSIS
Pentium 4 x 300
(#340 in TOP500, 6/2003)
A 4 Tflop/s theoretical maximum computing power
6
Grid Point
Monitoring
with Ganglia
7
URL: http://gideon.csis.hku.hk/status/
HKU Grid Point:
Grid and Cluster Software
Grid middleware
Remote job
submission
- Globus Toolkit
(GT) 2.0, 2.4, 3.0.1
Gatekeeper
gideon.csis.hku.hk
Job scheduling
- OpenPBS 2.3.16
- Maui 3.2.5
Programming
Local Job Scheduler
-HPF, Fortran 90
-C, C++, Java with MPI
-JESSICA2 (HKU)
Gideon
Ostrich
Srgdell
Real
Communication Lib
- MPICH-G2 1.2.3
IPC / Network communication
8
Main Computing Facilities:
HKU-CSIS Gideon 300 Cluster
9
Research Projects in HKGrid




HKBU: Knowledge Grid (Autonomous grid service composition).
HKPU: Peer-to-peer (P2P) grid, meta scheduler, fault tolerance
HKUST: Development of sensor Grid infrastructure
HKU










ETI: Modelling of Air Quality in Hong Kong (E-Business Technology
Institute with the Environmental Protection Department, HKSAR)
Computer Centre : HKU campus grid ; scientific applications running
across the ApGrid
CSIS : Robust Speech Recognition (J. Wu and Dr. Q. Huo)
CSIS : Simulation for the DNA Shuffling Experiment (W.H. Hon and Dr.
T.W. Lam)
CSIS: Approximate String Matching on DNA Sequences (L.L. Cheng)
CSIS: Whole Genome Alignment via Mutation-Sensitive Sequence
Similarity (H.L. Chan, N. Lu, and Dr. T.W. Lam)
ME: Parallel Simulation of Turbulent Flow Model (Dr. C.H. Liu, Dept. of
Mechanical Engineering)
CSIS : HKU Grid Point (863 Project: China National Grid)
CSIS: Asia-Pacific Grid
…..
10
HKGrid – Connections



Links to China National Grid (CNGrid)
and Asia-Pacific Grid (ApGrid) via
CERNET and APAN
Internet2 connection to the Abilene
backbone at Chicago, USA
Plays the role of a gateway for the
other bigger grids
11
China National Grid (CNGrid) : 863 Project
China National Grid Participants
上海超级计算中心
中科院计算所
香港大学 (CSIS)
中科院计算所开发的网
格系统软件已将计算所
、华中科技大学 与香
港大学网格节点连接在
一起,通过
VEGA_GOS …
西安交通大学
中国科技大学
国防科技大学
中科院应用物理所
清华大学
Supporting software :
VEGA (织女星) grid management system : dynamic
service deployment, single-sign-on, data replication, and
performance monitoring. Developed by Institute of
Computing Technology, Chinese Academy of Sciences
V.1.0 released 8
12
ApGrid / PRAGMA Testbed
10 countries
21 organizations
22 clusters
853 CPUs
13
ApGrid Demon on The HKU School Open Day (Oct. 2003)
14
Grid Research at HKU-CSIS
SRG Projects




SLIM + ODGPC
G-JavaMPI
JESSICA2
LOTS DSM
15
Our Goal
To construct an advanced grid computing platform to
accommodate utility-like computing via traditional and
“pervasive” means


Utility computing: to aggregate and make use of
distributed computing resources transparently
Traditional means: to utilize the dedicated HPC
facilities distributed across institutions


Performance and reliability are key
Pervasive means: any user can be resource
provider (e.g., idle PCs, etc.) or consumer, or both

Convenience and security are key
16
Research at HKU –
An Advanced Grid Computing Platform
(Programming Environment)
User’s convenience
Objectives
AGP
Research
Issues
Convenient system administration
Grid point construction
Performance and Reliability
G-JavaMPI
Load
balancing
JESSICA
LOTS
Singlesystem
image
SLIM
ODGPC
On-demand Grid point
construction (ODGPC)
17
SLIM
Single Linux Image Management
18
SLIM





Utility computing decouples computing platforms
(resources) and computing logic (applications)
I.e., a single platform can run completely different
applications
Problem: different applications demand different
execution environments (OS, shared libraries,
supporting apps, etc.)
Hassles associated with managing execution
environments (EE’s) in the resource provider side
offset the benefits of resource sharing
SLIM is a network service for managing and
constructing EE’s, and disseminating them to
remote computing platforms
19
SLIM – System design
How it works?
 A node sends a EE specification across the network to find the
Boot server
 Boot server delivers the requested Linux kernel
 Image server constructs an EE by collecting shared libraries, user
data, etc.
 Linux kernel boots, and contacts the Image Server to “mount” the
EE via a file synchronization protocol such as NFS
 Aggressive caching techniques are deployed to optimize
performance
20
SLIM – Ongoing and future work

SLIM has been managing:



the HKU-CSIS grid point (350 nodes)
for various grid research projects
an addition 300+ lab machines for
teaching purpose (different courses
have different requirements)
Future work


To overcome the challenges in
deploying SLIM over broadband links
Realizing the “pervasive utility
computing”
21
On-Demand Grid Point Construction (ODGPC)
SLIM
server
OS image
DHCP
SLIM
server
TFTP
/usr/local/gt3.2
1
2
client
client
client
1. Software installation at SLIM server
client
client
client
2. Client boots and obtains kernel
SLIM
server
client1
certificate
1
CA server
4
4
client1
3
3
client
2
SLIM
server
client
client
client1
3. OS image/App disseminated
4. Process to generate certificates
22
SLIM and ODGPC Performance Evaluation
256 PCs <
5 minutes
(OS only)



Boot up 100 machines (Linux + GT3) : 6 minutes.
Generate certificates for 100 machines (Step 4) : 30 minutes.
Total time : 6 + 30 = 36 minutes
23
SLIM – Key references


http://www.csis.hku.hk/
~cmlee/slim/
C.M. Lee, R.S.C. Ho,
D.H.F. Hung, C.L. Wang,
and F.C.M. Lau,
“Managing Execution
Environments for Utility
Computing,” Network
Research Workshop 2004
(with APAN 2004),
March, 2004.
(LinuxPilot 2004/04)
24
G-JavaMPI
A grid-enabled Java-MPI system with
dynamic load-balancing via process
migration
25
G-JavaMPI



A grid-enabled implementation of Java
binding of MPI, supporting efficient MPI
communication among distributed Java
processes
Supports transparent Java process
migration (through JVMDI) within and
across grid points for balancing CPU and
network loads
Communication-aware process migration
policies based on:


application’s communication pattern
available network bandwidth on grid overlays
26
G-JavaMPI – System design
(3)
(1)(1*)
Gatekeeper
LS
LS
Gatekeeper
Java-MPI
(2)communicatio
nWAN
(*) Some legacy
Migrating
(restarting a new
process through
Globus remote
job request with
delegated user
credentials and
Java-MPI job
credentials)
messages are
redirected
during migration
(2*)
JVM
(3*)
Gatekeepe
r
LS
M
Migration
module
resides
in each
JVM
27
G-JavaMPI – Ongoing and future work


The migration mechanism has been implemented
Future work targets at process migration policies




Goal: to offset performance pitfalls caused by
heterogeneity through dynamic process migration
Sources of heterogeneity in grids
 CPU, network, runtime environments, etc.
CPU and network heterogeneities cause long
“blocking” periods in cooperative processes, thus
limiting the system throughput
G-JavaMPI aims to detect and eliminate “blocking”
through process migration (e.g. to migrate a
“bottleneck” process to a faster node, etc.)
28
G-JavaMPI – Key references


L. Chen, C.L. Wang, and F.C.M. Lau, “A Grid
Middleware for Distributed Java Computing with
MPI Binding and Process Migration Supports,”
Journal of Computer Science and Technology
(China), Vol. 18, No. 4, July 2003, pp. 505-514.
L. Chen, C.L. Wang, F.C.M. Lau, and R.K.K. Ma,
“A Grid Middleware for Distributed Java
Computing with MPI Binding and Process
Migration Supports,” International Workshop on
Grid and Cooperative Computing (GCC-2002),
December 26-28, 2002, Hainan, China, pp. 640652.
29
JESSICA2 : A Java-Enabled SingleSystem Image Computing Architecture



JESSICA2 is a distributed Java Virtual
Machine (DJVM) which consists of a group of
extended JVMs running on a distributed
environment to support true parallel
execution of a multithreaded Java application.
Java threads can freely move across node
boundaries and execute in parallel to achieve
more scalable high-performance computing
using clusters
The JESSICA2 DJVM provides standard JVM
services, that are compliant with the Java
language specification, as if running on a
single machine – Single System Image (SSI).
30
JESSICA2 Architecture
A Multithreaded
Java Program
Thread Migration
JIT Compiler Mode
Portable Java Frame
JESSICA2
JVM
JESSICA2
JVM
Master
JESSICA2
JVM
Worker
JESSICA2
JVM
Worker
JESSICA2
JVM
Worker
JESSICA2
JVM
Worker
Worker
Global Object Space
31
JESSICA2 Main Features

Transparent Java thread migration




Full Speed Computation




Runtime capturing and restoring of thread execution
context.
No source code modification; no bytecode
instrumentation (preprocessing); no new API introduced
Enable dynamic load balancing on clusters
JITEE: cluster-aware bytecode execution engine
Operated in Just-In-Time (JIT) compilation mode
Zero cost if no migration
Transparent Remote Object Access



Global Object Space : A shared global heap spanning all
cluster nodes
Adaptive migrating home protocol for memory
consistency + various optimizing schemes.
I/O redirection
32
Ray Tracing on JESSICA2 (64 PCs)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 4402 seconds ( 1.2 hour)
Speedup = 4402/108=40.75
33
JESSICA – Key references



W.Z. Zhu , C.L. Wang, and F.C.M. Lau “A
Lightweight Solution for Transparent Java Thread
Migration in Just-in-Time Compilers,” The 2003
International Conference on Parallel Processing
(ICPP-2003), pp. 465-472, Taiwan, Oct. 6-10, 2003
W.Z. Zhu, C.L. Wang and F.C.M. Lau, “JESSICA2: A
Distributed Java Virtual Machine with Transparent
Thread Migration Support,” IEEE Fourth
International Conference on Cluster Computing
(CLUSTER 2002), Chicago, USA, September 23-26,
2002, pp. 381-388.
M.J.M. Ma, C.L. Wang, F.C.M. Lau. “JESSICA: JavaEnabled Single-System-Image Computing
Architecture,” Journal of Parallel and Distributed
Computing, Vol. 60, No. 10, October 2000, pp.
1194-1222.
34
LOTS: Large Object Space on Grid
LOTS
LOTS
OS
OS
H/W
H/W
Grid
LOTS
OS
Large Global
Object Space
LOTS
H/W
OS
LOTS
H/W
OS
H/W




A large software distributed memory system for Grid.
Provides a global object space larger than the process space (4GB in 32-bit CPU)
Uses local hard disk to store recently unused objects
Scope Consistency + Home Migration to reduce redundant data traffic
35
Summary

Performance




Reliability



G-JavaMPI, JESSICA, establish extensible grid
platforms (good for computation-intensive
applications)
Process/thread migration enables performance
optimization and load balancing
LOTS supports shared memory programming
environment on large object space (easier to develop
data grid applications)
G-JavaMPI migrates processes from failed machines
SLIM helps construct platforms for failover
Convenience


G-JavaMPI, JESSICA, and LOTS enable users to
harness distributed resources via traditional means
SLIM and ODGPC simplify Grid point managements
36
Conclusion



Grid/utility computing are relatively new
paradigms that deserve further
investigation
We address the performance, reliability,
and user convenience issues in grid/utility
computing
Our advanced grid computing platform
(consisting of G-JavaMPI, JESSICA2,
LOTS, and SLIM/ODGPC) is geared to
deploy in the HKGrid for easy adoption of
Grid technologies.
37
Q&A
Thank you!
The SRGers (Photo: 12/2003)
38
Reference
• Hong Kong Grid
• http://www.hkgrid.org/
• Grid Computing Research Portal
• http://grid.csis.hku.hk/
• The HKU Systems Research Group
• http://www.srg.csis.hku.hk

VEGA Project


http://vega.ict.ac.cn/
The HK Supercomputing Directory

http://www.hkhpc.org/~SuperDir/
39