Download thesis-final-pune

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Remote Desktop Services wikipedia , lookup

Lag wikipedia , lookup

Parallel port wikipedia , lookup

Computer cluster wikipedia , lookup

Files-11 wikipedia , lookup

Transcript
Abstract
EFFECTIVE OPTIMIZATION
TECHNIQUES FOR A PARALLEL FILE
SYSTEM
by Raghvendran M
Significant work has been done in evolving parallel IO architectures, IO interfaces and other
programming techniques. However, only a few mechanisms currently exist that bridge the
gap between the IO architectures and the programming abstractions. A Parallel File System
is the prime mechanism to deliver the high performance parallel IO on multiprocessor
machines for a wide class of scientific and engineering applications.
With the evolution of commodity clusters (called High Performance Computation or HPC
clusters) as a cost-effective computing platform for parallel computing, it is necessary to
have an optimized and portable parallel file system to satisfy applications’ IO needs. The
existing parallel IO mechanisms on such clusters, based on NFS, provide dismal IO
performance due to architectural limitation in disallowing de-clustering of file data as well as
due to the heavy weight nature of the protocol. Owing to mismatched semantics between
the application IO characteristics and parallel IO architectural features, several other IO
architectures based on shared or cluster file systems also perform badly in the cluster base
parallel computing environment. The parallel file system represents an appropriate split in
the semantics in the parallel application IO path, where parallel IO mechanisms and other
optimization techniques could be implemented at the IO platform level and exported
through feature-rich platform-independent interfaces.
In spite of significant amount of research in parallel IO techniques, portable parallel file
systems lack do not incorporate these findings and are not commonly used. Many of the
optimization techniques for the parallel IO in the literature, such as prefetching, have not
had any general-purpose implementations nor have been validated for a wide class of
application workloads or access patterns. There are many issues (such as timeliness) that
need investigation for prefetching to be effective. The incorporation of parallel IO
optimization techniques in the commodity clusters setup has not been satisfactory.
We establish the parallel file system as the right abstraction for parallel IO on a commodity
cluster from the performance and management perspectives. We also evaluate various
optimization techniques for parallel file system on a commodity cluster with the objective of
providing a fast scratch space on a real cluster-based supercomputer such as the C-DAC
PARAM Padma (ranked 171st in July 2003 edition of TOP 500 [35] list).
We extend a data prefetching technique for the parallel file system architecture and
demonstrate its effectiveness with a policy based feedback loop. Other optimization
techniques for improving a parallel file system are investigated to improve its performance.
This thesis makes contributions in the areas of analysis and design of these optimization
techniques for a parallel file system, such as an online predictive prefetching mechanism
with adaptive policy control, an adaptive flow control mechanism for supporting collective
calls from the architectural perspective and techniques for managing large data structure and
efficient file processing in the file system design.
A parallel file system incorporating the above stated optimizations has been implemented
on C-DAC's PARAM Padma, a one-teraflop 54-node cluster based parallel processing
system. These optimizations show significant improvement for the targeted application IO
workloads on this cluster.
TABLE OF CONTENTS
TABLE OF CONTENTS ................................................................................................... I
LIST OF FIGURES ......................................................................................................... III
LIST OF TABLES ........................................................................................................... IV
ACKNOWLEDGMENTS ................................................................................................V
GLOSSARY...................................................................................................................... VI
1.
INTRODUCTION ..................................................................................................... 1
1
2
3
PARALLEL APPLICATION IO CHARACTERISTICS ....................................................... 2
IO INTERFACES AND OTHER ABSTRACTIONS ............................................................ 3
EVOLUTION OF CLUSTER BASED PARALLEL IO ARCHITECTURES ............................. 4
3.1
NFS based Parallel IO architecture ............................................................... 4
3.2
Evolution of Parallel File System ................................................................... 5
4
PERFORMANCE OPTIMIZATION STRATEGIES FOR PARALLEL FILE SYSTEM ............... 6
4.1
Adaptive prefetching ....................................................................................... 6
4.2
Adaptive IO pipeline ....................................................................................... 7
4.3
Out of core computation ................................................................................. 7
4.4
Optimizations in the code design .................................................................... 7
5
CONTRIBUTIONS OF THE THESIS ............................................................................... 8
6
THESIS ORGANIZATION ............................................................................................. 8
2.
BACKGROUND AND RELATED WORK ........................................................ 10
1
2
3
4
QUALITATIVE ASSESSMENT OF PARALLEL IO ARCHITECTURES ............................. 10
PARALLEL FILE SYSTEMS ........................................................................................ 16
APPLICATION IO WORKLOAD CHARACTERISTICS ................................................... 17
OPTIMIZATIONS ...................................................................................................... 17
4.1
Discussion on prefetching techniques .......................................................... 18
3. IO WORKLOAD AND ITS FITNESS TO PARALLEL FILE SYSTEM (PFS)
ARCHITECTURE ........................................................................................................... 20
1
2
WORKLOAD CHARACTERISTICS .............................................................................. 20
GENERATING THE WORKLOAD ............................................................................... 22
2.1
b_eff_io .......................................................................................................... 22
2.2
BTIO .............................................................................................................. 23
3
4
5
6
4.
SUN PFS ................................................................................................................ 23
SUN PFS SOFTWARE ARCHITECTURE .................................................................... 25
MEETING THE REQUIREMENTS BASED ON WORKLOAD CHARACTERISTICS ............ 26
SCOPE FOR IMPROVEMENT IN SUN PFS .................................................................. 27
C-DAC PFS: OPTIMIZATIONS .......................................................................... 28
1
ADAPTIVE PREDICTIVE PREFETCHING .................................................................... 29
1.1
Prefetch mechanism ...................................................................................... 30
2
ADAPTIVE COMMUNICATION BUFFER RESIZING ..................................................... 35
2.1
The I/O time cost model ................................................................................ 36
3
OPTIMIZATIONS IN THE SUN PFS IMPLEMENTATION.............................................. 38
3.1
Design of large data structure ...................................................................... 39
3.2
Unified buffering in the IO server................................................................. 40
5.
RESULTS ................................................................................................................. 42
1
2
TEST INFRASTRUCTURE .......................................................................................... 42
RESULTS OF THE OPTIMIZATIONS ........................................................................... 42
2.1
File System Characterization........................................................................ 43
2.2
NFS experiment results ................................................................................. 44
2.3
Distributed-NFS experiment results ............................................................. 47
2.4
Prefetch mechanism and framework results ................................................ 48
2.5
Adaptive communication buffer optimization results................................... 54
2.6
Design of large data structures – enhanced R-B trees – results ................. 56
2.7
Unified buffering in IO server results........................................................... 57
3
INTEGRATED VERSION RESULTS ............................................................................. 57
6.
CONCLUSION & FUTURE WORK ................................................................... 60
1
FUTURE WORK ....................................................................................................... 60
REFERENCES................................................................................................................. 62
ii
LIST OF FIGURES
FIGURE 1: THREE TIER ARCHITECTURE FOR STORAGE ......................................................................... 10
FIGURE 2: POTENTIAL PARALLELISM IN A HPC CLUSTER .................................................................. 11
FIGURE 3: SUN PFS ARCHITECTURE ........................................................................................................ 24
FIGURE 4: SOFTWARE COMPONENTS OF SUN PFS ............................................................................... 24
FIGURE 5: THE DATA LAYOUT OF A PFS FILE ......................................................................................... 26
FIGURE 6: C-DAC PFS ARCHITECTURE .................................................................................................. 29
FIGURE 7: PREFETCHING FRAMEWORK AND PREDICTOR. ARCHITECTURE SCHEMATIC .............. 30
FIGURE 8: LZ EXAMPLE .............................................................................................................................. 32
FIGURE 9: PPM EXAMPLE .......................................................................................................................... 33
FIGURE 10: PREFETCH POLICY ................................................................................................................... 35
FIGURE 11: SYSTEM MODEL FOR COLLECTIVE IO ................................................................................. 36
FIGURE 12: NODE OF THE MODIFIED RED-BLACK TREE ....................................................................... 40
FIGURE 13:INTERNAL STRUCTURE OF IO SERVER ................................................................................. 41
FIGURE 14: COMPARISON OF NFS AND PFS FOR B_EFF_IO. READ/WRITE WORKLOADS FOR
TYPE 0 (75 % OF TARGETED WORKLOAD) SHOWN. FOR OTHERS, ONLY THE READ
WORKLOAD IS SHOWN. PFS HAS 4 IO SERVERS .......................................................................... 45
FIGURE 15: HIT RATE FOR B_EFF_IO IN PURE PREFETCH MODE......................................................... 49
FIGURE 16: HIT RATE IN BTIO IN PURE PREFETCH MODE................................................................... 49
FIGURE 17: HITRATE IN PARKBENCH (MATRIX 3D)IN PURE PREFETCH MODE............................... 50
FIGURE 18: PERFORMANCE READINGS OF B_EFF_IO ON PFS WITH PREFETCH MECHANISM. AS
WITH PREVIOUS READINGS, READ & WRITE WORKLOAD FOR ONLY TYPE 0 IS SHOWN. PFS
HAS 2 IO SERVERS .............................................................................................................................. 52
FIGURE 19: IMPACT OF ADAPTIVE BUFFER OPTIMIZATION OF B_EFF_IO BENCHMARKS THAT
HAVE COLLECTIVE IO CALLS ............................................................................................................ 55
FIGURE 20: IOTEST BENCHMARK TO SHOW IMPACT OF R-B TREE ORGANIZATION OF BLOCK
TRANSLATION ENTRIES IN PFS........................................................................................................ 57
FIGURE 21: PERFORMANCE FIGURE OF INTEGRATED C-DAC PFS AND BASE PFS FOR B_EFF_IO
................................................................................................................................................................ 59
iii
LIST OF TABLES
TABLE 1:DESIGN TRADEOFFS IN CLIENT FILE SYSTEM.......................................................................... 14
TABLE 2: BTIO RESULTS ON 16 NODES; PFS HAS 4 IO SERVERS ....................................................... 47
TABLE 3: BTIO RESULTS ON 16 NODES; PFS CONFIGURATIONS WITH 4 IO SERVERS .................. 54
TABLE 4:BTIO RESULTS ON 32 NODES; PFS CONFIGURATIONS WITH 4 IO SERVERS ................... 55
iv
ACKNOWLEDGMENTS
Firstly, my heartfelt gratitude for my advisor Prof. K Gopinath, who through these years of
association has made me a better person. His caring nature, enthusiasm and plentiful
encouragement have made this thesis see the light of the day.
Handling work at office as well as the research has never been easy. I thank Mr. Mohan
Ram, my supervisor at C-DAC and Mr. G. L. Ganga Prasad, my advisor at C-DAC, for
providing the ‘space’ and the resources.
Much of the engineering gone into to make C-PFS work on large supercomputing system at
C-DAC has been due to my ‘Storage’ teammates at C-DAC. Thanks to Rolland & Rashmi
for the hard work and thoughtful discussions. Thanks to Viraj, Sundar and Bala for the
unstinted support.
Lastly, my parents and in-laws have been pillars of support, putting up with my excuses for
not finishing the thesis, with patience. I dedicate this thesis to my wife – Shyam.
v
GLOSSARY
Sequential A sequential request is the one that begins at a higher offset than the point
where the previous request from that compute node ended.
Consecutive A consecutive request is a sequential request that precisely begins where the
previous request ended.
Access interval Number of bytes between the end of one request and the beginning of the
next.
Concurrent sharing. A file is concurrently shared if two or more processes have it opn at
the same time. Depending on the mode of the open, it could be read-sharing or write
sharing; when the modes are not same, it is said to be concurrently shared.
vi
Chapter 1
1. INTRODUCTION
The evolution of various abstractions in parallel IO has been a result of tension between
ease of programming and machine architecture efficiency. Many abstractions have evolved
to exploit the progress in the machine architectures while some others provide more
expressive programming models to organize data as per the application’s computing logic.
These have been in the form of low-level IO interfaces such as HPF IO [1], Scalable IO
Initiative Low-level API [2], MPI-IO [3] or Data libraries/models such as NetCDF [4],
HDF5 [5], River [6] apart from techniques such as out-of-core computations and other data
management tools. Various mechanisms and techniques have evolved that bridge the two
requirements without much compromise in loss of program organization semantics and the
underlying architectural efficiency. But the techniques requiring serious user intervention,
such as adapting an application to a new API or paradigm, have not been much successful,
as they typically require complex rework.
From the architecture perspective, the fundamental technique for providing high
performance IO has been de-clustering of file data across multiple disks; this leads to
complex data layouts and organization. Even on a particular physical organization of the
disks, the parametric design space for providing high performance parallel IO is often so
large that generally a compromise is made to efficiently support only a few workloads of
interest. So, to get performance, the application programmer is forced to tune the
application access pattern to match the systems’ optimized workload pattern. Often the
implementation of the IO software stack on such a setup is adhoc (being too tightly
integrated with application etc), making it difficult to export the software feature and
functionality at the general-purpose cluster platform level (e.g. multitasking, multi-user
support).
As many of the interfaces and abstractions for parallel IO have evolved in the application
domain, their implementations tend to be too tightly coupled with that application domain
and may not efficiently support general application class workload access pattern.
We discuss the usefulness of the parallel file system in decoupling the parallel IO
optimization techniques from the abstractions. We incorporate new optimization techniques
to improve the PFS performance for common workload access patterns. The results of
these optimizations have been experimentally verified with the representative scientific
workload on the target platforms – C-DAC’s cluster-based PARAM machines [15] (latest
being PARAM Padma, ranked 171st in July 2003 edition of TOP 500 list). The parallel file
system with the optimizations has been deployed on these systems.
PARAM machines are cluster-based supercomputers having PARAMNET-II [16] as well as
Gigabit Ethernet as the cluster interconnects. The communication substrate for these
clusters is based on C-DAC’s user-level lightweight protocol implementation – KSHIPRA
[33]. KSHIPRA provides MPI-2 [17] as the distributed programming abstraction.
1
Parallel application IO characteristics
Application level characterization studies [7] indicate that though wide variability exists in
the access patterns among the applications, they are typically ‘patterned’. They exhibit a
certain structure and regularity in the access patterns – these are generated when the
application threads use high-level IO interfaces. These interfaces commonly provide
complex data abstractions such as multidimensional structures and arrays. Previous work in
this parallel application IO workload analysis has leveraged the data access pattern
information towards the design of the various parallel file system optimization techniques
[8] as well as refinement of interfaces and abstractions [2] [3].
2
2
IO interfaces and other abstractions
Many optimization techniques require extra information from the application for its
effective operation. Information on non-contiguous IO, collective IO, non-blocking IO and
other hints could be used by the underlying system to improve the IO performance by
matching the appropriate lower level system primitive to the IO request. Many such
interfaces for parallel IO management, at various levels, have been proposed – HPF, SIO
LLAPI, MPI-IO.
HPF IO is a set of extensions to Fortran through which application developers can provide
hints to the compiler on the data distribution and loop parallelization. This interface
supports data parallel applications by providing a notion of global data structures with
facilities to perform data decomposition on them.
SIO (Scalable IO Initiative) provides a low-level API (LLAPI) with minimal features for
parallelism. Targetted at IO subsystem developers, it provides primitives for various IO
access patterns and consistency models. No major parallel IO software has been developed
using SIO LLAPI.
MPI-IO has been the most popular interface in the distributed memory parallel machines
for program development. Integration with MPI [17] has provided it with a rich set of
abstractions such as groups, communicators, send/receive model that could be used in the
IO programming also. Some of the useful features are – construction of non-contiguous
structure types using ‘data types’, support for both independent and collective modes of
operations, non-blocking and split-collective IO calls.
In all the above approaches, the ‘view’ or the application data layout of the file is not stored
along with the data but computed during the execution, used and subsequently discarded.
Hence, data access optimizations to the files cannot be performed by any other application
or library, unless the application is running. High-level libraries evolved as the need for
3
complex file structures was felt as compared to plain byte-sequential UNIX file model.
Some of the high level libraries in use are NetCDF, HDF.
Data models represent the object-oriented approach to the data management. Non-trivial
data model are typically tied to the application domains though the file metadata could be
accessed and intrepreted. Data is accessed by a name (rather than a pathname), could be
annotated and have a standard layout.
NetCDF provides annotated rectangular arrays of basic types and is widely used in
atmospheric science applications. HDF5 gives a notion of ‘dataspace’ (dataset structure
without a type) with an abstraction for groups, facilitating namespace management.
3
Evolution of cluster based parallel IO architectures
Given the commodity nature of the cluster, the de-facto storage access mechanism in
practice in such an environment is NFS.
3.1 NFS based Parallel IO architecture
NFS provides stateless file sharing and hence needs protection mechanisms for correct
operation during concurrent operations where certain consistency guarantees are needed.
This is not provided by the standard NFS client.
The fundamental architectural problem with using standard NFS components, in delivering
the parallel IO performance, is the lack of NFS’ ability to aggregate the distributed storage.
Lack of control on caching & buffering as well as lack of scalable & integrated locking
mechanisms in NFS coupled with a typical heavy weight implementation makes it inefficient
to provide parallel IO. This is shown in subsequent sections. Performance-enhancing
techniques [9] notwithstanding, the NFS based system continues to be network limited for
parallel IO workload. Some work has been done using NFS protocol but it requires
customized NFS clients to provide parallel IO [10] to avoid the network bandwidth
limitation.
4
A commodity setup to provide logical de-clustering of files has been experimented for
providing parallel IO with NFS. In this setup, called ‘distributed NFS’, multiple NFS servers
on different nodes serve the same storage presented by a shared or a cluster file system such
as SUN QFS, VERITAS CFS etc. It has been shown [34] that such commodity based
solutions fail due to semantic mismatch between application expressed concurrency and the
strong consistency provided by the shared file systems.
This brings out the motivation for exploring other mechanisms for providing parallel IO in
commodity cluster environment.
3.2 Evolution of Parallel File System
A parallel file system represents a system level abstraction that provides key benefits of the
parallel IO techniques at the multiprocessor platform level, typically optimized for a class of
workload. It provides structure for various smart algorithms as well as other techniques such
as caching, buffering and prefetching to work together and provide performance for the
targeted application workload. These techniques have been shown to be very effective in
accelerating specific application workloads [11]. If these techniques are not adaptive with
respect to the changing access patterns and architectural behavior, heavy performance
penalty is incurred for non-conforming workloads. But a file system structure often
provides opportunity for supporting various workloads by providing flexible mechanisms to
change the file system behavior (e.g. Intel PFS [37]). A file system view can also leverage
common storage management tools and practices for managing parallel IO storage space.
The PFS implementations, by design, distributes the data and aggregates the concurrent IO
data paths to the distributed storage to provide higher bandwidth. The cluster based PFS
implementations have typically been provided by the vendors but are non-portable. The
commercial implementations are tightly integrated with other cluster components and
deliver very high IO throughput to the scientific applications. But these implementations are
proprietary and vertically integrated and hence are not portable across platforms. Given the
trend of Beowulf-like clusters that are built out of commodity components, in particular for
5
scientific computing, there is compelling need for portable parallel IO implementation
mechanisms. Parallel Virtual File System (PVFS) [12], Sun PFS [13] etc implementations
have been a step towards that direction.
4
Performance optimization strategies for parallel file system
Because of the large design space, parallel file systems typically target a specific workload.
This provides a lot of opportunity for optimizations to support different workloads.
The current work provides a parallel file system for C-DAC’s tera-scale supercomputer
system (PARAM Padma system ranked 171st in July 2003 edition of TOP 500 list). Due to
the time required and complexity of an ab initio project on a parallel file system, an open
source parallel file system from Sun Microsystems has been used as a starting point. But the
open source version had stability problems in terms of memory leaks and race conditions
and hence was unusable as such. The effort in debugging this is significant as the parallel file
system is part of a real operational system.
Many optimizations for the parallel file system have become possible both due to the
characteristics of the targeted workload as well as opportunities for increasing the
sophistication of the base code implementation.
4.1 Adaptive prefetching
Prefetching has been used in multitude of environments to tolerate the latency of data
arrival – CPU-memory complexes, file and storage systems, applications etc. As indicated
above, the IO access patterns in the parallel applications and in the targeted workloads in
particular, though complex, tend to be structured and regular providing an opportunity for
prefetching of data. Usefulness of this technique by discerning the application IO access
pattern has been demonstrated to some extent in the PFS community [11] [14]. But the
efficacy of this technique to actually tolerate disk latency (by eliminating disk accesses) has
not been demonstrated in a production environment. Factors such as timeliness of the
6
prefetch, provision of a feedback loop in the prefetching mechanism that significantly
impact the performance in practice have not been studied in literature. We have developed a
novel adaptive predictive prefetching technique based on text compression technique for
this purpose.
4.2 Adaptive IO pipeline
In a cluster with distributed disks, it is also important to maintain an adaptive pipeline
between the clients and the disks so that IO channels from the clients to the disks are
efficiently utilized. This technique, similar to the memory interleaving in computer
architectures, is effective in tolerating the disk latency particularly in the ‘de-clustered file’
environment for certain ‘collective’ workloads. The current work demonstrates the
performance improvement in parallel IO due to this mechanism.
4.3 Out of core computation
This is typically used when data size is larger than the main memory and the ‘paging’ or
‘staging’ is performed by the application itself depending on its computation locality. Since
this is an application level technique, we do not consider this technique further.
4.4 Optimizations in the code design
These optimizations assume significance in a real system dealing with large data and
metadata sets in a file system scenario. An inadequate attention to these aspects can hinder
the benefits proffered by other architectural optimizations. These aspects have been handled
inefficiently in the Sun PFS – the base PFS code.
4.4.1 Handling large data structures efficiently
In a file system scenario, data structures storing block descriptions are typically large
(>10000) in number and require frequent access. The data structure organization needs to
optimally handle lookup, extent-building operations efficiently.
7
4.4.2
Buffer handling in the file system
In the parallel file system environment, concurrent accesses to the same block (not
necessarily the same byte range) could generate multiple IOs on the same block leading to
multiple copies of the same block in the file-processing module wasting memory as well as
processing time. Consistency is generally not an issue as the IO issuers typically impose the
IO order.
5
Contributions of the thesis
The contributions of the thesis are as follows:

Demonstrating the performance benefits of a user-level parallel file system1 as
compared to traditional IO architectures for parallel IO, qualitatively and
quantitatively.

Development of novel optimization techniques for cluster based parallel IO –
adaptive online predictive prefetching and buffer control.

Newer designs for the handling of large data structures and the control flow for
processing file/subfile requests in parallel file systems, in general.
6
Thesis organization
Chapter 2 provides the background material and related work on the parallel IO
architectures, application workload as well as PFS optimizations.
Chapter 3 discusses the targeted workload in detail.
1
The parallel file system is implemented as user level daemons with a VFS module to maintain binary
compatibility with UNIX IO applications. IO, though the MPI-IO interface for high performance uses MPI
for IPC, whose implementation on PARAM machines is provided by KSHIPRA, the user-level protocol
suite comprising of user level implementations of Active Messages and Virtual Interface Architecture.
8
Chapter 4 provides the details about the base parallel file system (the Sun PFS) – its
architecture and organization, data layout, high-level control flow of the processing of
file/subfile requests.
Chapter 5 provides the details of new optimizations developed for the base PFS system. CDAC PFS (C-PFS) is the base parallel file system enhanced with the optimizations.
Chapter 6 discusses the results of the base PFS, optimized PFS as well as the NFS based
parallel IO architecture.
The thesis ends with discussion on future work.
9
Chapter 2
2. BACKGROUND AND RELATED WORK
We now present a qualitative performance assessment of cluster based parallel IO
architectures that provide MPI-2 messaging infrastructure (similar to PARAM machines).
1
Qualitative assessment of parallel IO architectures
This section discusses the various software related factors that affect the performance in the
storage architecture and the need to consider the performance aspects at every software
layer while architecting the storage solution.
Cluster compute or application nodes
LAN
Storage servers
SAN
Storage
Figure 1: Three tier architecture for storage
The architecture we consider for discussion is given in Figure 1.
The parallelism while performing a single-application IO is critical to HPC clusters while
system throughput is important for a home directory workload. A production facility should
10
support both the workloads among many others to be useful. Figure 2 depicts the various
software layers.
Application layer
MPI/UNIX
layer
messaging
Client File System
Storage server layer; NFS
server or UFS server or
shared file system
IO through various
paths
Figure 2: Potential parallelism in a HPC cluster
To achieve end-to-end parallelism in IO, all layers – from application to the storage boxesmust provide sufficient primitives to express concurrent IO operations. The same applies in
the case of fault tolerance and load-balancing features too – that the issue needs to be
addressed in all the layers. The description of Figure 2 is given below:
Application layer. The application needs to perform IO in parallel using a parallel IO
interface, say a messaging cum IO layer such as MPI-IO interface or distributed application
using UNIX-IO interface. We consider MPI as the messaging layer for further discussion.
MPI/UNIX layer. The MPI implementation such as MPICH [18] needs to translate
application’s parallel IO expressed in the programming interface layer into underlying file
systems’ parallel operations. There could be a loss in parallelism in this layer if the
underlying file system does not provide sufficient correctness guarantees for concurrent
11
operations. MPI needs to hold locks on the appropriate resources to ensure application
correctness thus serializing some of the parallel accesses. Most conservative approach is to
collect the IO into a single process and perform IO.
Various implementations of MPI provide different degrees of parallelism, some of them
dictated by the features of the underlying file system.

MPI implementation serializes the IO by letting a master process (on one of the
client nodes) collect the data and perform IO on the underlying client file system –
say, NFS file system. So, only one client process performs IO on the underlying file
system.

MPI implementation such as MPICH implementation with ROMIO [9] tries to
retain the application expressed parallelism by using parallel IO techniques. ROMIO
is optimized for performing parallel IO on NFS [19]. For a single application-level
IO request consisting of widely distributed non-contiguous byte ranges, ROMIO
lets multiple client nodes to perform IO in large sequential chunks. The data is then
exchanged between client nodes. There could incur some overhead as most of the
data read could go unused in such scenarios.

MPI implementation that performs true parallel IO on a parallel file system e.g. IBM
MPI implementation on IBM GPFS [20].
Client file system layer. The client file system provides the file system namespace and is the
visible layer of the storage. The ‘directory-tree’ view of the file system is provided at this
layer. In the cluster environment, the client file system provides a single file system
namespace across all the clients, provides a ‘specified’ (could be UNIX-like, Andrew File
System [21] -like session, causal, relaxed etc) consistency model integrated with locking and
error model. In short, the behavior of the file system is provided by the client-access file
system. The key deployment aspects of the client file system are performance & file system
12
availability. The performance can be characterized into system throughput (ST) and single
application performance (SAP). A key aspect of the system flexibility is the fraction of ST
achieved by SAP. File system availability implies maintaining the file system view on the
client nodes in the face of server failure(s)2. Client file system designs fundamentally differ in
these terms:

Data layout of the file system metadata & data on the multiple servers (Block or file
level stripping).

Method of access (Exclusive access of a data portion from a server or load
balanced).

Visible storage portion on servers. If the entire storage portion is visible on all the
servers, this could be used for fail-over method when a server fails. Depending on
the design of client file system and the server overhead for maintaining the shared
state, the ‘shared storage visibility’ could be used for adaptive load balancing the
requests of client file system on the servers even when there are no server failures.
This visibility could be achieved through a Storage Area Network (SAN)
configuration or by replication methods.
The following table lists various tradeoffs in the client file system. These representative
application classes are assumed – Meta Data Intensive (MDI) and Content Data Intensive
(CDI). Both the applications contain multiple processes. MDI uses large amount of small
files and mainly perform metadata operations – such as compilation job. CDI uses few large
files and performs large amounts file IO.
2
Client node failure, per se, is not included in this scope because, as per the specification, client file system should be
available on all client nodes and restarting or migrating the application computation is the responsibility of cluster manager
or the application itself. But the error model of the client file system would, however, govern the state of the file system data
and metadata when the client node crashes.
13
Shared server control
on storage portions
Excl. server control on storage
portions
Block level layout & access
File level layout & access
* Blocks of a single file are striped
across servers; all the servers serve
accesses on single file.
* Best layout and access for CDI-SAP
and also CDI-ST.
* Better for MDI-SAP and MDI-ST.
* Different servers serve different portions
of the global file system tree or Single
server serves all the files for a set of client
nodes
* Best layout and access for MDI-ST. For
better
results
for
MDI-SAP,
IO
configuration needs to be integrated with
cluster manager.
* For better results in CDI-ST, IO
configuration needs to be integrated with
cluster manager. Not good for CDI-SAP.
* Same advantage as above in respective cases.
* Shared server control could be used to provide storage availability by fail-over
technique in respective cases. Mechanism of re-mapping storage portions to a
different server needs to be done by the fail over mechanism and communicated to
the cluster manager. If the shared control is used adaptively, not just for fail-over case,
it could be used for load balancing of client workload on server layer. This adaptive
mechanism will be the best for ST & SAP with CDI or MDI. It is made possible by the
SAN shared storage.
Table 1:Design tradeoffs in client file system
In special cases, even a single node can saturate the IO bandwidth of server layer such as
when a large compute server runs an IO intensive application and has trunked Gigabit
Ethernet interface to LAN.
The various possible deployments are discussed next.

NFS as client access file system. This belongs to the ‘Exclusive server control on
storage portions’ & ‘File level layout and access’ category. Either the global file
system is grafted by mounting various NFS file systems (case 1) or a set of client
nodes mount the entire file system from a single node (case 2).
o Case1: MDI ST performs better. SAP performs better provided files are
distributed across all the servers; Typically this setup is not backed by SAN
and hence fail-over is not possible; SAP performance is bounded by single
server performance.
14
o Case 2: Provided the client compute nodes are selected such that all the
servers are ‘covered’, CDI-SAP is as good as CDI or MDI ST where
multiple servers can serve the same file; The NFS server layer typically sits
on cluster or shared file systems that provide the global file system view;
This scenario arises, as the NFS protocol is IP based and there is no virtual
IP. Hence a set of client nodes mounts from a server

PFS as the client access file system. This belongs to ‘Exclusive server control on
storage portions’ & ‘block level layout and access’ category. Each file IO uses
scatter-gather operations on the various servers to complete the IO operation. A
server failure renders the whole file system inaccessible, unless another server takes
over the portion served by the failed server.

Shared-control file system. The client file system does not mount from a single
server. The storage is accessed as entire files or blocks. For the sake of management,
certain portions of storage are accessed from a certain server. Since, storage can be
accessed from any server, alternate servers could serve storage for load balancing or
fail over.
Storage server layer: These servers provide the access to the storage. Depending on the
configuration of the storage, they could export local disks or shared storage volumes.
Storage servers could in turn stripe the data across the Host Bus Adapters (HBAs) to
multiple storage boxes. Storage boxes could implement RAID or other forms of data
organization for performance/redundancy mechanisms. Various configurations of the
server layer are:

Each of the servers in the server layer exports its local file system. The clients
construct the global view by mounting them appropriately. The visibility of the local
file systems to other servers depends on the SAN configuration.
15

Storage server layer exports shared storage space as a cluster or shared or SAN file
system. Such file systems typically extend the single node file system behavior
semantics to provide UNIX semantics on the shared file system at the server layer.
NFS servers run on the individual servers and export the shared file system to the
client nodes. In other words, NFS runs as an application of the shared file system.
But there is no coordination among the
NFS servers to maintain
consistency/coherency of data.
To summarize, while the parallel IO architectures provide for multiple data paths to
aggregate bandwidth, the commodity components fail to integrate efficiently to provide the
required benefit. So, a portable mechanism for PFS implementation with the following
architectural characteristics is desired:

Provide distributed and concurrent data paths from multiple client nodes to the
distributed storage space so that parallel IO can exploit the aggregate bandwidth and
provide low latency.

Applications may not demand UNIX-like consistency. So, a framework that
provides ‘flexible consistency’ policy through scalable serializing mechanisms– such
as locking is needed, so that client nodes could attain a desirable consistency level.
2
Parallel file systems
The key characteristic of parallel file system is its ability to handle concurrent access to and
fine grain partitioning of large files. Concurrent access is provided as a feature by the means
of shared file pointer for a set of processes. PFS is optimized for handling IO to large files
and may not be optimized for meta-data intensive workloads.
Some of commercial PFSes are IBM GPFS [20], Intel PFS, SGI XFS. IBM GPFS is latest
commercial incarnation of IBM Vesta parallel file system [22]. Vesta file system supports
16
two-dimensional files and provides primitives to control the data layout, striping factor on
an individual file basis. File views can be set on a per open basis which enables the file
system to handle concurrency efficiently. Unlike Vesta, GPFS is a general-purpose file
system with enterprise features - like reliability availability - and optimized for streaming
applications. It provides a cache coherent, UNIX semantics file model on all the cluster
nodes (including non disk attached nodes). Intel PFS manages concurrency and file pointers
through six IO modes. Each of the modes has a different set of file access semantics and
performance characteristics. Structurally and semantics wise SGI XFS is similar to IBM
GPFS.
Apart from Sun PFS, Parallel Virtual File System (PVFS) – an open source PFS
implementation is available. But like Sun PFS it has minimal performance enhancing
features. Additionally, PVFS has no kernel module (VFS) interface. Though this feature is
not critical to the application performance, it is important from the system administration
viewpoint as existing storage management software could be directly used.
3
Application IO workload characteristics
Nieuwejaar [7] has studied the IO workload characteristics of scientific and engineering
applications. The characterization was done by tracing the scientific application executions
on different parallel processing machines in multiple production facilities for extended
duration of time. For the current work, we use the workload from the above study as the
general representative workload.
4
Optimizations
Much of the work in parallel IO research concentrates on reducing disk accesses [19] by
aggregating smaller IO accesses or optimizing collective calls. Two-phase IO suggested in
[19] optimizes the collective IO by accessing all the data in extent bounded by first and the
last byte range. This extent is contiguously partitioned among the IO nodes that then
17
perform IO on a single IO chunk. In the next phase, the data is exchanged between the
nodes to get the appropriate data – data shuffling. This approach has limited effectiveness –
may perform more IO than necessary for small and widely distributed data ranges; hence
the benefit is highly workload dependent.
4.1
Discussion on prefetching techniques
Prefetching for the PFS, has been studied to some extent in the community [14] [11]. Kotz
identifies a fixed set of common access patterns based on the workload study and uses
predictive prefetching for future data access. However, this approach performs very badly
for new access patterns and also not extensible or flexible to accommodate new access
patterns.
The approach used by [11] for prefetching is pretty comprehensive. Hierarchical predictors
at local (per thread) level and at global (application level) coordinate to classify the IO
patterns. The local predictor follows the same model as [14] but uses an analytical model
based on Artificial Neural Network (ANN) to perform the classification making the
implementation easier; and so has same drawback as [14]. The global predictor uses dynamic
probabilistic function to determine future accesses using Hidden Markov Model (HMM)
[23] for prediction. While HMM based model can recognize and probabilistically classify
new access patterns, both ANN and HMM need learning duration before they can be
effective. In this approach, the predictor adaptively changes the underlying parallel file
systems’ caching, consistency and prefetching policies based on its internal state. Unlike
prefetching of blocks, ‘predictor’ performs intelligent filesystem policy control. Hence, this
requires explicit identification and classification of access patterns for controlling the file
system policies. The drawback is new patterns that may require a different policy will be
handled as that of the ‘probabilistically nearest’ access pattern as the access pattern
classification is done apriori. Also, this approach requires comprehensive and deep support
from the underlying file system that exposes the policy control mechanisms of file system
structures.
18
However, none of the above approaches incorporate the feedback mechanism – either on
the effectiveness of the predictor or the timeliness of the prefetch requests - that could have
a negative impact in practice as prefetching is pure overhead when not useful.
Vitter [24] suggests a prefetching technique based on text compression methods. As the
compression methods use dynamic probability text sequences to encode the text symbols,
the intuition could be used to predict future sequences using the probability state. Kroeger
[25] uses this technique effectively to prefetch files in a single node system.
We use the technique stated in [24] as it does provide probability states for new access
patterns as they occur in the sequence and does not rely on explicit classification of the
access patterns of interest. However, as in previous cases, we are not aware of any
framework using this mechanism that incorporates a feedback loop.
19
Chapter 3
3. IO WORKLOAD AND ITS FITNESS TO PARALLEL FILE SYSTEM (PFS)
ARCHITECTURE
In the previous chapter, the performance impact of several IO architectural alternatives was
explored. The workload ‘sweet spot’ of these various architectural alternatives have been
identified and it is shown, in brief, that for the scientific and engineering workload, PFS
architecture provides the best match to the targeted IO workload. In this chapter, we
analyze the IO workload characteristics in detail and also show with an example Parallel File
System – SUN PFS, to illustrate the ‘good’ design points of PFS from the workload
perspective.
1
Workload characteristics
The studies on IO characterization of parallel scientific and engineering applications have
been far and few in between. Nevertheless, the studies have been comprehensive and
capture the access patterns at the node as well as the application level. The studies analyze
the application IO signatures – temporal and spatial access patterns, request sizes, and
sequential and highly irregular access patterns. Since IO workload has deep performance
implications for the IO architectures the workload characterization help develop more
effective parallel file system policies.
Nieuwejaar [7] has identified common parallel application IO workload characteristics by
tracing the production workloads over a variety of widely used platforms – iPSC/860, CM5. This study has identified strong common trends in the workload characteristics that could
be used as workload generalizations for parallel IO architecture design. Reed [38] traces
three different applications on a particular parallel computing platform, at the individual file
access level coupled with an analysis of source code and essentially reinforces the IO
20
characterization conclusions in [7]. Since, the work at NCSA [38] has been published very
recently, we assume the adaptive file system policies based on this IO characterizations to be
very relevant.
The summary of common parallel application workload characteristics from the above
studies is given below. The workload characteristics are also restated as a ‘design wish’.

Files between jobs are not shared. The application typically proceeds in pipelined
functionally phased manner with files shared (no concurrent sharing) between
phases. So, parallel IO architecture need not focus on meta-data intensive workloads
such as web-server workload.

Rarely files are read-write. The IO operations on a particular file are mostly readonly or write-only. So, the read and the write data paths can be optimized separately.

Write traffic is higher than read. So, there will be a greater gain in optimizing write
IO data path.

Write files are rarely concurrently shared but read files are mostly concurrently
shared. So, expensive concurrency control mechanisms can be avoided for the same
data region. It should be noted that files could be concurrently written rarely to the
overlapping data regions.

Most of the request sizes are small (<4000 bytes) but most of the data is transferred
through large requests. The IO architecture should optimize small IO access.

Within a phase, parallel reads occur in a structured manner repetitively over
different portions of the file. Since, data regions are rarely re-accessed, caching
technique will not provide performance. However, this structured access should be
used to improve access performance.
21

Most of the IO happens in synchronous- sequential mode. So, the design can avoid
keeping expensive data copies.
In C-DAC, the target application that has been available for evaluation is a seismic
application whose access pattern characteristic – simple synchronous-sequential, is already
captured above in the common workload characteristics.
The studies conclude that parallel file system designs that rely on a single, system-imposed
file system policy is unlikely to be successful and the exploitation of the IO access pattern
knowledge is crucial in obtaining a substantial fraction of the peak IO performance. The
thesis presents some of the adaptive techniques based on the access pattern characteristics.
2
2.1
Generating the workload
b_eff_io
b_eff_io [26] is a parameterized test suite that can be used to simulate most of the parallel
application IO patterns. Hence, we have used it for evaluating our parallel file system
implementation as one target workload. b_eff_io measures the system performance of file
accesses in “first write”, “rewrite”, and “read” modes. The access patterns generated are
strided (with individual and shared file pointers) and segmented collective accesses to one
shared file per application; as well as non-collective access to one file per process. The
number of parallel processes accessing the files is also varied. To normalize the effects of
the optimizations that are dependent on peculiar buffer alignments, the performance is
measured both for ‘well-formed’ (in terms of the IO size being a power-of-two) as well as
non-well formed buffers. But primarily, the accesses in this application are structured – not
random – and hence mimic many typical workloads of scientific applications. These patterns
are discussed in detail in the results section.
22
2.2
BTIO
Another application kernel used to evaluate the performance of the parallel IO architecture
is BTIO [27], which is representative of the Computational Fluid Dynamics (CFD)
applications on the PARAM machines. The file generated by BTIO is a write-only file. The
write workload is patterned at two levels. A segment is appended at every time step, and
with in a time step, the client-writes happen in strided and sequential fashion. From each of
the clients’ perspective, the writes are sequential in a file segment before the writes start in the
next segment. The segments are arranged consecutively in the output file. The write intervals are
fine grain, regular and the write request size is constant. There is no read sharing or write sharing
at the byte-level but due to the fine grain nature of the workload, there could be block level
sharing. The workload generated, in abstract, is structured write or allocate-write with no
reuse of application data (except as false shared disk blocks). The definition of the terms in
italic is given in the Glossary section.
The next chapter examines an example Parallel File System – SUN PFS and shows that the
above application workload characteristics is best matched by the PFS architecture. The
architecture of the Sun PFS and the MPI implementation on top of it is described in detail
in [28]. However, even with PFS architecture providing best match from the performance
perspective, many optimizations are still possible given the IO access pattern characteristics.
The following section provides a background so that the optimizations could be
subsequently described.
3
SUN PFS
The SUN PFS system follows the client-server paradigm with multiple file servers. The file
data is striped on all the file servers with each file server managing the ‘striped’ data of the
files on the locally attached disks. The clients access the data by issuing requests to the
servers and reconstruct the file data. A file server can handle multiple disks belonging to
multiple file systems. Simply put, an aggregation of disks (along with associated file servers)
constitutes a PFS. This is depicted in the figure 3.
23
Figure 3: Sun PFS architecture
Figure 4: Software components of Sun PFS
The notion of file system is constructed at the client side that provides the required
consistency behavior of the file system. The servers simply provide access to the managed
storage objects and maintain the locks on the storage objects (but does not enforce access
24
control as in OBSD architecture [29]). The servers fully manage the data object as far as the
allocation, creation, de-allocation and the destroy operations of the storage objects. The file
system operations are implemented as communication operations between the clients and
the file servers. The client-server model of Sun PFS does not involve any inter-client or
inter-server communication, thus simplifying the state maintenance in the file system design.
Both MPI-IO and UNIX interfaces are provided on the client nodes. UNIX interface is
implemented through a VFS module. MPI-IO interface uses MPI implementation for
communication. The inodes or the meta-data of the parallel file system is again maintained
on the server in a distributed fashion. So, there is no requirement of a persistent store on the
client nodes.
4
SUN PFS software architecture
The software components of Sun PFS is depicted in the Figure 4.
The runtime library provides the MPI-IO interface to the application program on the client
node. The MPI-IO interface is implemented on the Sun PFS. The VFS layer provides the
UNIX interface with binary compatibility. Both the interfaces share the file system
namespace. To ease debugging, the interaction between the VFS component and the server
or IO daemon is handled by a ‘proxy’ daemon. The IO daemon runs on the server node
and provides the access to the locally attached storage. The file system and file metadata is
accessed and manipulated only through the VFS interface; runtime library is used to only
accelerate the data access. IO daemon is a multi-threaded process that provides an OBSDlike interface that performs block and directory management. IO daemon provides minimal
buffering capabilities.
The data layout of a PFS file is depicted in Figure 5.
25
Figure 5: The data layout of a PFS file
A logical PFS file is a collection of subfiles, with each subfile residing on an IO daemon
configured in the cluster. The PFS file is de-clustered across the file servers and hence each
chunk on the IO server can be independently accessed.
5
Meeting the requirements based on workload characteristics
Avoiding the meta-data cache in the client increases the cost of meta-data operations on the
client but also avoids expensive concurrency mechanisms among the clients for the metadata and hence provides a simpler data path from application to storage by avoiding the
VFS layer during the normal IO access. De-clustering of data improves the IO operation
performance as seen in Chapter 2.
The server-client protocol is synchronous as the workload is also mostly synchronoussequential. The data caches are also avoided, as the data reuse is very minimal in the
applications - data is read once, processed, new data is generated and written to a new file.
26
As seen, the SUN PFS provides the right architecture for the parallel IO but there is still a
lot of scope improvement this in the next section.
6
Scope for improvement in Sun PFS
While the Sun PFS provides the parallel IO architecture, the performance enhancement
opportunities exist such as new techniques to hide disk latency as well as new designs to
handle large data structures and processing of file/subfiles. New techniques could leverage
on the applications’ ‘regular access pattern’ access characteristics to provide better
performance. By learning on the IO patterns, data could be accessed in advance to hide the
disk latency. One such technique investigated is predictive prefetching that uses effective
predictive predictors to prefetch data. Also flow control mechanisms ensure efficient
parallel IO pipeline between clients and the storage. Apart from these techniques, module
design level optimizations such as unified buffering in the IO server and large data
structures’ design have also been investigated that are a manifestation of the code
implementation but provide significant performance improvement for the overall parallel
IO system.
We describe and analyze the above stated performance enhancements to the base SUN PFS
along with the results in the subsequent chapters.
27
Chapter 5
4. C-DAC PFS: OPTIMIZATIONS
C-PFS essentially retains the same architecture, file system layout and structure and access
methods as the Sun PFS but has optimized parallel IO techniques. The focus of the parallel
IO optimizations is:

A disk being the slowest component in the IO pipeline, minimize accesses to the
disk wherever possible. Techniques could be buffering to merge accesses or caching
whenever data is reused.
o As seen in the previous section, the targeted workload is either read-once or
write-once-no-read type (except internally due to false sharing), caching
technique may not help. Buffering may help reducing disk accesses provided
locality information is available. Also, caching could be very complex in a
distributed environment, hence it has not been considered for optimization.

Hide the disk latency to the computation. Possible techniques to implement this
technique could be prefetching or asynchronous IO access – if applications
appropriately use asynchronous MPI-IO calls.
o Prefetching could be beneficial in targeted workload due to structured
access patterns. Support for asynchronous IO is provided in base Sun PFS.

Adaptive communication buffer resizing.
o An important consideration in a striped access system such as PFS during a
collective MPI-IO call is to maintain a ‘distributed pipeline’ in such a way
28
that network and disk bandwidths are efficiently utilized. This may involve
adaptively packetizing the IO buffer.

The above techniques should use bounded memory resources and avoid copies and
should be computationally efficient to avoid any overheads.
The modified software architecture of the Sun PFS after incorporating the above changes is
shown in Figure 6.
Figure 6: C-DAC PFS architecture
1
Adaptive predictive prefetching
In the current work, we show that in the parallel file system environment with de-clustered
files, the technique – predictive prefetching – shown in [24] could be effectively used for
online prefetching of data. Another significant heuristic we use (based on the workload
29
characteristic that accesses are sequential or consecutive) is to maintain the first order
difference of the logical block numbers, instead of the block numbers in the predictor state
to capture to access pattern. Furthermore, a framework has been devised that provides
feedback on the effectiveness of the predictors – determined both by goodness of prefetch
algorithm and the relative arrival of prefetch requests with the data requests (that is IO
pipeline behavior). The number of blocks prefetched as well as the frequency of issue of
prefetch requests is adaptively changed based on the feedback, so that it is most effective.
The mechanism and the framework have been integrated with the Sun PFS architecture.
Figure 7: Prefetching framework and predictor. Architecture schematic
1.1
Prefetch mechanism
The predictors are local predictors and are part of each process of the parallel application as
shown in Figure 7. Each of the predictors, hence, executes concurrently. Prefetch buffers
are maintained in the IO servers and are maintained in LRU fashion.
The two predictors that we implement are based on the text compression techniques
namely Limpel-Ziv (LZ) algorithm and Prediction by Partial Matching (PPM) [30]. The first
order difference of the block numbers is used in the algorithm. Our experience shows that
the first order difference suffices to identify an access pattern in most cases.
30
APIs for integrating new predictors have been provided so that user can specify different
predictors without recompiling the application.
1.1.1
LZ predictor
The LZ method breaks the input string into substrings and builds a parse tree as it
progresses through the string of block differences. Each path from the root to a leaf node
represents a substring. When a new block arrives the difference between the last block
number and the new block number is determined. It is then checked to see if the current node
of the parse tree, which may be the root node in the case of a new substring or any other
node if an existing substring is revisited, has a child with this difference. If yes then its count
is incremented and it is made the current node. Otherwise a new node is added as a child of
the current node in which case a new substring has been formed and hence the current
node is set to the root again. If an existing substring is revisited then the visit counts of the
nodes along that path are incremented. Predictions are made by following the most
probable path from the current node. Adding the differences in the nodes along the most
probable path to the current block number gets the predicted block numbers.
Figure 8 shows this using an example. Suppose we were to predict the next 4 blocks after
43. Since the current node is the root and the current block number is 43, the next four
blocks after following the most probable path (root-> 1(5) -> 2(4) -> 1(2) -> 1(1)) would be
44, 46, 47, 48.
The nodes in the tree are proportional to the number of substrings. In such a case the tree
may keep growing for a large file. It is therefore necessary to put an upper bound on the
number of nodes in the tree and update the tree by deleting the least probable paths.
31
Figure 8: LZ example
1.1.2
PPM predictor
The PPM technique is based on Finite Multi-Order Context Modeling where the models
condition their predictions on few immediately preceding symbols that form the context. The
length of the context is called the order.
The block differences are placed in a tree based data structure with a visit count associated
with each node. The height of the tree is limited by the maximum order. A path from the
root to a node represents a context that has been seen previously. An array of k pointers
indicates the nodes at the contexts from 0 to k-1 at 0 to k-1 levels in the tree. When a new
block arrives, the difference between the last block number and the new block number is
determined. Then the children of each of the current contexts, C(i) where i = 0 to k-1, are
checked to see if they have this difference. If such a child exists then this sequence has
occurred before, so the context C(i+1) is set to point to this child and its count is
incremented. Otherwise this sequence has occurred for the first time and so a child denoting
this difference is created and the context C(i+1) is set to point to the new node. All the
32
contexts are updated in this manner for every new block. Predictions are made either by
taking the most probable path of all the contexts or the most probable path at the highest
context.
Figure 9 shows how PPM algorithm works using an example. The tree grows in breadth
while the height remains constant. The tree should be updated regularly to keep the number
of nodes below a certain threshold limit by removing the less probable subtrees at the root
node.
Figure 9: PPM example
1.1.3
Prefetch mechanism integration with Sun PFS
The predictor is part of client process runtime library that issues prefetch commands and
prefetch buffers are maintained in the IO servers to avoid consistency problems. As the files
are de-clustered, if the predictor is in the IO server, it would be difficult to ascertain the
application access pattern, as it would get only the translated portion of logical file access
pattern. One more advantage of predictor being in runtime library is that the application can
33
plug in its own predictor in the framework. Prefetch block sequence is obtained before
translation to IO server is done and prefetch request is issued to each of the IO servers after
translating the logical block numbers. Prefetch buffers interpose between user requests and
the disk on the IO server and are maintained in LRU order.
1.1.4
Prefetch framework
This framework controls the behavior of the prefetch request processing on the IO server.
The framework ascertains the efficacy of the predictor algorithm, and even if so, whether
prefetch request is processed in time to satisfy the subsequent data request(s). The feedback
is sent to the prefetch command issuer (usually the client).
The prefetch request fetches a window of blocks that satisfy more than one subsequent data
requests. The terms we use are: the prefetch window, which is the set of blocks that are
prefetched to benefit the next few data requests, and the time window, which is the gap
between two subsequent prefetch requests measured in terms of the number of
intermediate data requests. The values of these two windows may vary across the
application. We adopt a heuristic approach.
We start with some predefined prefetch window on the client side. A prefetch request may
be followed by many data requests but we do not know how many data requests the blocks
of a particular prefetch request may cover. So we set the time between two prefetch requests
to some heuristic value, say the next prefetch request is sent after 'n' data requests. The
prefetch window and time window sizes are adjusted depending on the following factors on
the server side.
o Prefetch Window Fully/Partially Satisfied (WFS/WPS).
o High/Low Hit Rate (HHR/LHR).
WFS means that the next prefetch request arrives only after the previous one is completed.
WPS means that the next prefetch request arrives too early. This may occur because the
34
previous predictors were not very accurate and so some data requests too had to be
scheduled along with the prefetch requests thereby requiring the prefetch requests to take a
longer time for completion. Figure 10 explains policy decision tree incorporated in prefetch
framework to tune the prefetch issuer behavior. The hit rate is measured as usage of the
prefetch buffers requested between two prefetch requests by data requests.
Figure 10: Prefetch policy
2
Adaptive communication buffer resizing
In PFS, the clients access data across the network through the IO servers. IO servers
maintain staging buffers for these accesses. These buffers are populated by the disk and
consumed by the network for read requests and vice-versa for write requests. It is observed
35
that if there is a mismatch between network and disk bandwidths, either the network or the
disk will idle waiting for the other sub-system to finish using the buffer. The idling period
depends on the size of the staging buffer as well as the relative speeds of network and disk.
Disk and network processing can be pipelined by choosing a disk buffer size that matches
the relative disk and network bandwidths and the amount of data requested by the various
clients.
We model this situation and arrive at the packetization strategy. Figure 11 depicts the model
for the collective IO.
Figure 11: System model for collective IO
2.1
The I/O time cost model
Let us assume that we have parallel file system with m servers, and there are n clients which
are part of an application trying to access a common size of data such that at each server d
bytes of data to be accessed from disk (assuming that single disk is configured) and to be
transferred to n clients.
36
So at each server, time spent T , for the entire IO is
T = d bytes access from disk + d/n data transfer to n clients
T = d/d_bw + Od + ((d/n)/n_bw + On) * n
where,
d_bw
- disk bandwidth
Od
- disk latency (seek time + access time)
n_bw
- network bandwidth
On
- network latency (round trip)
Now if we split the data d at each server into k parts such that the disk and network access
can be pipelined,
T = d/(k*d_bw) + Od
//
first disk access
+ (k-1 ) *max ((Od + (d/(k*d_bw))), (d/(k * n_bw))+On*n)
+ ((d/(k*n_bw)) + On*n)
//
pipelined access
//
last network access
if network limited(d_bw > n_bw)
(Differentiating to find max value of k),
k=sqrt(d/(n*On*d_bw))
otherwise
k=sqrt(d/(Od*n_bw))
2.1.1
Scenario with m IO servers
As the parallel file system has multiple I/O servers, the correct I/O time cost model can be
built only when we consider all the servers, that is an application I/O is complete when it
receives the requested data from all the I/O servers as the data is distributed equally among
all the I/O servers in round robin manner. So in any access the I/O cost is the sum of d
37
bytes disk access at each server (assuming that the n clients access result in d bytes at each
server) and accessing the respective parts (d/n) at client from all the servers.
T = d/d_bw + Od + ((d/n)/n_bw + On) * m
Now if we split the d bytes into k parts such that disk and network operation can be
pipelined,
T = (d/(k*d_bw)+Od)
//
first disk access
+(k-1 )* max ((d/(k*d_bw)+Od),((d*m)/(n*k*n_bw)+On*m))
+((d*m)/(n*k*n_bw)+On*m)
//
pipelined access
//
last network access
if network limited(d_bw > n_bw*(n/m)
(Differentiating to find the max value of k),
k=sqrt(d/(m*d_bw*On))
otherwise
k=sqrt((d*m)/(n*n_bw*Od))
So, based on the relative speeds of the disk and network and number of participants in a
collective IO operation, the individual messages from the clients will be packetized to
maintain the IO pipeline.
3
Optimizations in the Sun PFS implementation
The above optimizations focused have been motivated by the parallel file system
architecture. As mentioned earlier, in practice, there are many factors that affect the
performance of the PFS. We focus next on following optimizations that are based on the
implementation of Sun PFS.

Design of large data structures.
38
o Scientific applications typically tend to operate on large files. The metadata
information needs to be kept on highly efficient data structures.

3.1
Unified buffering in IO server.
Design of large data structure
The block list mentioned in Figure 13, in the IO server, maintains the logical to physical
translation block numbers. For a large file, this data structure could contain large number of
entries with each entry maintaining the translation for the logical block. The operations on
this data structure are: insert translation, extent build, change translation, and purge
translation. In the original implementation, purge translation was not implemented and also
the block list was implemented as a linear list.
A pluggable framework has been implemented where the other modules in the IO server,
accesses block list through abstract data structure operations. For the ease of
implementation while retaining performance benefit, principle of red-black trees [31] as
opposed to B-trees, AVL trees etc is used to implement the block list. A modified version of
R-B trees is used –

Each node in the R-B tree is a fixed size bucket containing the block translations.
An upper and lower bound (of logical block numbers) describe the bucket, which
are dynamic. All the logical block translations of the blocks between the bounds,
whose translation exist and are required, are maintained in this bucket in sorted
order. Figure 12 depicts a node of this tree.

For efficient linear search, each bucket node will have at least half the entries filled.

A depth first search of this data structure will be used for the extent building.
39
Figure 12: Node of the modified red-black tree
3.2
Unified buffering in the IO server
The IO server does not attempt to perform concurrency control on the storage it manages.
Since, the clients perform locking at byte range level, mutually exclusive ranges could fall in
same block. The IO accesses on the byte ranges, in the current design, will be performed as
multiple separate read-modify-write requests as well as could be present in two different
staging buffers in the IO server. The internal design has been changed so that a unified
buffer is maintained with the accesses as an attached list.
The various lists as shown in Figure 13 – data requests, disk requests, ready buffers - could
have copies of the same block as per the original design. The modified version has at most
single buffer copy for each data block managed by the IO server.
40
Figure 13:Internal structure of IO server
41
Chapter 6
5. RESULTS
1
Test Infrastructure
The C-PFS has been developed and tested on these platforms.

PARAM Padma: Cluster of 40 nodes of IBM POWER4 1.0 GHz 4-way SMP
machines interconnected by Gigabit Ethernet – with 8 being IO nodes with locally
attached disk; disk attachment is across Ultra SCSI rated at 80 MBps.

PARAM 10000: Cluster of 20 nodes of Sun UltraSPARC-II 300 MHz 4-way SMP
machines interconnected by Fast Ethernet – with 4 being IO nodes with locally
attached disk; disk attachment is across Ultra SCSI rated at 40 MBps.
The tests are conducted on smaller clusters also. Generally, in the test configuration, the
client nodes to IO server ratio is maintained at 4:1, unless specified otherwise. The file
system block size is 32 Kbytes.
For the NFS configuration, the NFS server is one of the cluster nodes specified above.
The test applications are b_eff_io and BTIO. B_eff_io is synthetic benchmark where
‘problem’ size increases with the increasing number of processes while BTIO has constant
problem size with increasing number of processes.
2
Results of the optimizations
In this section the results of the optimizations done are discussed. We primarily use
b_eff_io and BTIO for benchmarking.
42
2.1
File System Characterization
As mentioned earlier, we will characterize the file system against the commonly used access
patterns in parallel applications. B_eff_io benchmark will be used to generate the workload.
We first co-relate the b_eff_io access patterns with target workload as mentioned in [7]. The
benchmark generates accesses for first write, rewrite and read modes for varying buffer
sizes. The patterns are:

Type 0 - strided collective access, scattering large chunks in memory to/from disk.
o
Synchronous–sequential mode of access with equal amount of data from the
nodes. There is no byte-sharing. Operations from the nodes could execute
in any order.

Type1 - strided collective access, but one read or write call per disk chunk.
o Same as 0. But operation execution needs to be in order.
o Type 0 and 1 constitute 78% of access in scientific workload.

Type2 - noncollective access to one file per MPI process, i.e., to separate files.
o Local-independent mode of access.
o Type 2 constitutes 0.88 % of access in scientific workload.

Type 3 - same as (2), but the individual files are assembled to one segmented file.
o Global independent mode of access. Each node accesses a different
segment in the concurrently shared file.

Type 4 - same as (3), but the access to the segmented file is done collectively.
43
o Same as 3. But the accesses are collective – in-step among all the nodes.
o Type 3 & 4 constitute 11.9 % of access in scientific workload.
These access types characterize the file system ability to handle multiple exclusive streams of
data access.
2.1.1
B_eff_io execution characteristics
B_eff_io is a timed benchmark. Roughly, the benchmark spends equal amount of time on
each type of access (0, 1, 2, 3, 4 etc). Within a type, the IO operation is repeated till the time
elapses. In general, about 60 % of the traffic is generated by type 0 alone, and rest of the
traffic generated is equally divided among the other types.
2.2
NFS experiment results
We compare the results of the NFS based parallel IO setup with that of PFS.
The b_eff_io is run on an 8-node client setup of PARAM Padma. For PFS, we use a 4-node
IO server setup and a single node serves NFS.
Figure continued on the next page
44
Figure 14: Comparison of NFS and PFS for b_eff_io. Read/Write workloads for
type 0 (75 % of targeted workload) shown. For others, only the read workload is
shown. PFS has 4 IO servers
The figure 14 depicts the performance comparison between PFS and NFS for b_eff_io.
Read and write behavior is shown for type 0. As the behavior between the reads and writes
is roughly the same on the other types, only the read graphs are produced for the other
types.
45
NFS behavior: For all the types, NFS shows the same behavior – there is a ‘plateau’ effect
beyond 1 MB and no scaling with respect to the request size. Also, the performance does
not degrade for non-well formed request sizes. This means the bottleneck could
fundamentally be in the NFS architecture than the NFS protocol. Possible NFS
architectural change could be de-clustering of files to give better performance. NFS shows
negative scaling when the number of client processes are increased.
PFS behavior: PFS shows good scaling with the increasing request size. However, for types
– 1, 2 and 3, there is negative scaling when the number of client processes is increased.
Performance slightly comes down for non well-formed request sizes. For type 0, the
performance and scalability for both read and write is very good showing the architectural
benefits of PFS.
Analysis: PFS performs better than NFS for all the types and shows scalability both with
respect to request size as well as number of processes, in both read and write modes, except
in three types (that constitute nearly 20 % of the workload seen in a typical scientific
application workload [7]). Type 1 requires writes from the processes to happen in specific
order. PFS graphs show that this ordering mechanism is not scalable. Type 2 is local
independent workload; the performance figures for type 2 indicate that handling multiple
files is not scalable in PFS. This is as per design as PFS is optimized for large file access.
Type 3 performs uncoordinated segmented file access from multiple processes. The
negative scalability with the number of processes in type 3 is – PFS does not perform
buffering at the IO server level; the uncoordinated accesses from different processes,
though to a file, fall in different file regions. So, multiple small IO requests result in
decreased scalability as the number of processes grows resulting in increase in the number
of file segments.
For types other than 0, the performance is poor initially for both NFS and PFS as data is
written in single small chunks and also PFS is optimized for large access. As PFS is
optimized for handling large requests, it performs poorly for small requests
46
Num Procs
Class B (9 procs)
Class B (64 procs)
PFS
1940 secs
3725 secs
NFS
10552 secs
Not measured
Table 2: BTIO results on 16 nodes; PFS has 4 IO servers
BTIO has even more fine grain access pattern and can be seen to incur heavy penalty with
NFS kind of architecture as can be seen in Table 2.
On the whole, the PFS performs better than NFS on all types. As per the targeted workload
(more than 80 % of the workload types) – type 0, type 3 & type 4 – PFS performs far better
than NFS.
2.3
Distributed-NFS experiment results
As can be seen, the parallel IO architecture based on NFS architecture performs very poorly
for scientific workload, we experimented with a commodity (based on NFS) setup that
provides a logical de-clustering of files.
A 6 node (each node 4-way Sun UltraSPARC 900 MHz with 2 Gigabit interfaces and 2 host
bus adapter to Storage Area Network) IO server tier running Sun QFS exports that same
storage on all the IO nodes. Each of the IO server nodes run NFS server, effectively
exporting the same storage. Client nodes in PARAM Padma run standard NFS clients and
each client mounts from an IO server (statically fixed policy). So, an application running on
multiple client nodes can perform IO through different IO server, thus in a logically declustered fashion.
The b_eff_io test is conducted with 4 client nodes in the PARAM Padma cluster, all
mounting the same storage from different servers. The performance is abysmally poor as
well as the data integrity is not maintained. The data integrity problem arises due to lack of
synchronization between the NFS servers. The poor performance is attributed to strong
consistency semantics offered by QFS. In PFS, even though there is no coordination
47
between the IO servers, client nodes synchronize and decide the desired consistency level.
While in the QFS based scenario with NFS access, there is no coordination between either
the clients or the servers. The results have not been shown, as data integrity is not
maintained.
The results of the various optimizations are described now.
2.4
Prefetch mechanism and framework results
The core component of the prefetch framework is the predictor module that captures the
data block access history and predicts the future block accesses.
2.4.1
Predictor module
To characterize the LZ based predictor module behavior without any external influence –
pure prefetching, we simulate the application execution using the block trace. The block
trace is fed to the predictor module and the module output – which is the predicted block
list – is analyzed with the subsequent data block requests (As mentioned earlier, the first
order difference is stored in the predictor module, not the actual block numbers).
The hit rate for a prefetch request is – the percentage of prefetch blocks from the current
prefetch request, used by the data requests between the current prefetch request and the
subsequent prefetch request. This approach does not measure coverage of data blocks in
the subsequent IO accesses in the current prefetch request. This is because, in practice, the
prefetch requests could be serviced along with the data requests and are useful only if they
arrive before the data requests. So, the focus of the optimization is to fetch enough blocks
that is constrained by availability of buffers and how soon the data requests will follow; the
hit rate measures how usefully, the prefetch blocks satisfy the subsequent data requests (or
sub sets of).
The prefetch policy discussed subsequently adjusts the prefetch window size depending on
its effectiveness – predict correctly, timely service of prefetch request. The Figures 15, 16 &
48
17 depict the performance of LZ predictor for b_eff_io, BTIO and parkbench [36]
benchmarks, with a prefetch window size of 8.
Figure 15: Hit rate for b_eff_io in pure prefetch mode
Figure 16: Hit rate in BTIO in pure prefetch mode
49
Figure 17: Hitrate in Parkbench (matrix 3D)in pure prefetch mode
It can be seen that 65% of the blocks prefetched during BTIO application run, 78 % of the
blocks prefetched during b_eff_io benchmark and 87 % of the blocks prefetched during
parkbench run will satisfy the subsequent data accesses. The effectiveness of the predictor,
in practice, will however, depend on run time behavior of disks and the timing of client
request issues.
2.4.2
Prefetch mechanism
The prefetch mechanism with the LZ predictor is tested with the b_eff_io benchmark. The
results are provided for an 8-client node setup with 2 IO server setup in PARAM Padma
cluster. The application run configuration is varied as 4-node and 8-node runs with 1
process/node and 2 processes/node. Only the read graphs with 1 process/node are shown
as the write/rewrite graph characteristic matches the read.
A careful observation of the result graphs shown in the Figure 18 suggests that prefetching
mechanism does not change the b_eff_io behavior or the characteristics of the parallel IO
behavior but just enhances the performance for some patterns.
50
Figure continued on next page
51
Figure 18: Performance readings of b_eff_io on PFS with prefetch mechanism. As
with previous readings, read & write workload for only type 0 is shown. PFS has
2 IO servers
Type 0 gives the best increase in performance among all the access types. This is because in
type 0, every process performs IO in terms of an IO segment in every IO call. The IO
segment has multiple non-contiguous chunks from a process in strided fashion. The IO
pattern formed by IO segments from all the processes globally forms consecutive sequence
on the logical file. This helps in firstly, faster building of IO pattern in the predictor state
and secondly, coalesced IO access. Almost 75 % of the traffic generated in the benchmark is
generated from this type.
In all the other types, the process’ IO segment contains exactly one IO chunk and the global
IO segment is a much smaller contiguous segment than type 0.
For the prefetching optimization, types 1 & 2 do not show any performance improvement,
type 3 actually shows performance degradation and type 4 shows slight improvement in
performance.
52
B_eff_io, as mentioned earlier, generates five types of traffic patterns (types 0, 1, 2, 3, 4).
The fraction of total traffic handled during type 1 is one-tenth of the total traffic generated
by the benchmark and IO is synchronous as well as performed in single chunks and IO
requests are spread across many chunk sizes. So, building of predictor state for useful
prefetching is not effective as seen from the performance figures.
Types 2 and 3 perform uncoordinated independent accesses. Prefetch mechanism at this
time, has a global LRU policy for buffer replacement and does not recognize individual
prefetch streams from the clients (though hit rates are measured on a per client basis). In
this scenario, there could be contention for buffers and a fast client could replace some
useful buffers for another client. The buffer contention has been observed and its impact
can be seen in degradation of performance for accesses of type 2 & 3.
Type 4 too performs independent accesses but they are coordinated so the contention for
the buffers is limited unlike the previous case.
Also, as the number of processes increases per node, due to the buffer contention, the
efficacy of prefetch mechanism comes down.
An important consideration, in practice, is memory management of prefetch buffers and
this significantly impacts the effectiveness of the prefetch mechanism. Currently, the
prefetch buffers are statically allocated to a certain fraction of available memory. Since, fast
allocation and deallocation of prefetch buffers is essential for its effectiveness, having large
number of prefetch buffers could have a negative impact unless managed with fast buffermanagement mechanism. At this time, we are using a simple buffer management policy to
manage prefetch buffer that is not scalable for large number of prefetch buffers.
Results for BTIO are shown below,
53
Num Procs
Class B (9 procs)
Class B (64 procs)
PFS (with prefetch)
2084 secs
3006 secs
PFS
1940 secs
3725 secs
Table 3: BTIO results on 16 nodes; PFS
configurations with 4 IO servers
BTIO workload, as discussed above, has fine grained partitioned regions. BTIO is a
‘constant output file size ’ application written over a fixed set of iterations. This means
writes from the processes are more fine grained and small, and given the block size of 32K,
more writes will hit the same block. And the predictor will pick up this pattern faster. As the
workload is otherwise, extending writes only, prefetching may not help as currently prefetch
does not do preallocation of blocks.
In conclusion, given the access pattern distribution, prefetch mechanism performs better for
more than 80 % of the common access patterns in the scientific and engineering
applications.
2.5
Adaptive communication buffer optimization results
This optimization targets the large collective calls. We show impact of this optimization on
the b_eff_io types that use collective calls – 0, 1 and 4 (Figure 19). The readings have been
taken on 32 nodes of PARAM Padma (1 process/node) configured with 4 IO servers.
Only type 1 shows some improvement while there is minimal impact on the other patterns.
54
Figure 19: Impact of adaptive buffer optimization
of b_eff_io benchmarks that have collective IO calls
Impact of this optimization on BTIO Num Procs
Class A (25 procs)
PFS
1255 secs
PFS with buffer optimization
1215 secs
Table 4:BTIO results on 32 nodes; PFS configurations with 4 IO servers
55
2.6
Design of large data structures – enhanced R-B trees – results
The design of the data structures with large number of entries has significant impact on the
performance. This is demonstrated by the iotest benchmark. In the benchmark a large file is
consecutively constructed in 64 iterations. The performance results and the final file size is
shown in the Figure 20. In this configuration, PFS contains 2 IO servers and there is 1 client
node. In the base PFS, the time to append successive IO block takes incrementally more
time due to linear block translation list. The PFS with R-B tree implementation shows that
each successive append to the file takes the same time irrespective of position of append.
Figure continued on next page
56
Figure 20: Iotest benchmark to show impact of R-B tree organization of block
translation entries in PFS
2.7
Unified buffering in IO server results
To implement the prefetching mechanism, unified buffering in IO server is required to
maintain correctness of the file system. Even though it does not give significant
performance gain, it results in better memory resource utilization in the base
implementation.
3
Integrated version results
The integrated version of C-DAC PFS with all the optimizations has been run on a large
cluster configuration. While the individual optimizations on the base PFS show
performance improvement, the integrated version does not show significant improvement
for types other than 0. Upon investigation, it is observed that all the optimizations are not
totally independent and could have side effects.
57

The adaptive buffer optimization models the IO access as though all the data comes
from the disk, while some of the data could have been prefetched. This can
introduce inaccuracies in the models for maintaining an efficient IO pipeline.
Similarly, prefetching also introduces inaccuracies in the IO pipeline.

Similarly, coalescing can introduce inaccuracies in the models of other
optimizations3
Integrating multiple optimizations so that there are no negative interactions between them is
not a well-understood subject and needs further investigations. The performance results are
given in Figure 21. The runs have been performed on PARAM Padma. The client nodes
run 1 process/node and ratio of client to IO nodes is maintained at 4:1. In the following
figures AxB denotes – A is number of clients and B is number of IO nodes.
Type 0
3
Type 1
Even in the compiler area there has been some work to represent optimizations formally and to represent
their interactions, in practice, the many optimizations are ordered based on experience and any significant
understanding of the interactions between them
58
Type 2
Type 3
Type 4
Figure 21: Performance figure of integrated C-DAC PFS and base PFS
for b_eff_io
59
Chapter 7
6. CONCLUSION & FUTURE WORK
In this work, we motivate the need for parallel file system in commodity-based clusters
targeted at parallel scientific applications. We also demonstrate a practical approach to
workload driven optimizations to a parallel file system. The optimizations have been
motivated from the opportunities we had in the architecture to support the targeted
workload as well as the base code we took for parallel file system implementation.
We report on design and implementation of newer optimizations for PFS, such as a novel
prefetching scheme. The multiple optimizations done to the base PFS show better
performance establishing the usefulness of the techniques used. Furthermore, it also
establishes that PFS architecture shows better performance than the NFS architecture. The
thesis has also touched upon the practical aspects in implementation that need attention.
The optimized PFS now provides high performance parallel IO for applications written
using MPI-IO interface by delivering good aggregate performance to the client nodes. The
system is currently deployed in a tera-scale cluster, PARAM Padma (ranked 171st in July
2003 edition of TOP 500 list) running scientific and engineering applications providing fast
scratch space.
1
Future Work
The current C-DAC PFS has better performance and stability than its base implementation.
But it still can be used as a fast scratch space only as it lacks enterprise features such as
online capacity expansion, backup, snapshot etc. While it may be justified to maintain the
separation so far, there is a compelling need to merge the capabilities. One such system
evolution has been that of IBM’s GPFS that has evolved from PIOFS to Vesta to GPFS.
60
The current work could evolve in such a direction. This is enabled by emergence of flexible,
modular file system architectures such as Lustre [32] that provides the necessary
infrastructural support for this activity.
Prefetching mechanism could also evolve to make it more scalable – better memory
management of prefetch buffers and local LRU replacement policy of the buffers.
Currently, the predictions are done at the client-tier of the three-tier storage architecture.
Due to de-clustering technique and the workload patterns, it may so happen that small
strided access patterns at the client level could actually form large sequential IO at the IO
server tier. A global access pattern analysis at the IO server could coalesce the IO as well as
the prefetch requests for reducing the disk accesses.
61
REFERENCES
[1]
High Performance Fortran. The official HPF-1 standard. Scientific
Programming, 2(1-2);1-170, Spring-Summer 1993
[2]
The official SIO Low-Level API standard. Proposal for a common file
system
programming
interface
version
1.0.
http://www.pdl.cs.cmu/SIO/SIO.html, 1996
[3]
The MPI-IO Committee. MPI-IO: A Parallel File I/O Interface for MPI,
Version 0.5. http://lovelace.nas.nasa.gov/MPI-IO, April 1996.
[4]
Rew, R., and G. Davis. NetCDF: An interface for scientific data access.
IEEE computer Graphics and Applications, 10(4):76-82, July 1990
[5]
HDF5 – A New Generation of HDF. http://hdf.ncsa.uiuc.edu/HDF5
[6]
Arpaci-Dusseau, et. al. Cluster I/O with River: Making the fast case
common. In Proceedings of IOPADS’99, May 1999
[7]
Nieuwejaar, Nils, et. al. File access characteristics of parallel scientific
workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):
1075-1088, October 1996
[8]
Nieuwejaar, Nils, et. al. The Galley parallel File System. Parallel
Computing, 23(4-5):447-476, June 1997
[9]
Thakur, Rajeev, et. al. A case for using MPI’s derived datatypes to improve
IO performance. In Proceedings of SC98, November 1998
[10] Garcia, Felix, et. Al. An Expandable Parallel File System Using NFS
servers. In Proceedings of VECPAR 2002. 2002.
[11] Tara, Madhyastha, et. al. Exploiting Input/Output Access Pattern
Classification. In Proceedings of SC97. 1997
[12] Ibrahim F. Haddad. PVFS: A Parallel Virtual File System for Linux Clusters. Linux
Journal, Volume 2000, Issue 80. November 2000
[13] Sun Microsystems white paper. Sun Parallel File System. Feb. 1998
[14] Kotz, David, et. al. Practical Prefetching techniques for Parallel File
System. In Proceedings of the first international conference on Parallel and
distributed information systems. 1991
[15] PARAM Padma and PARAM 10000, http://www.cdacindia.com
[16] C-DAC. ParamNet-II product brochure.
http://www.cdacindia.com/html/htdg/products.asp
[17] The MPI-2 specification. http://www.mpi-forum.org/docs/docs.html
62
[18] MPICH.
A
Portable
MPI
unix.mcs.anl.gov/mpi/mpich/
implementation.
http://www-
[19] Thakur, Rajeev, et. al. An extended two-phase method for accessing
sections of out-of-core arrays. Scientific Programming, 5(4), Winter1996
[20] Prost, Jean-Pierre, et. al. MPI-IO/GPFS, an optimized implementation of
MPI-IO on top of GPFS. In Proceedings of Supercomputing 2001. 2001
[21] Silberschatz, Avi., et. al. Operating System Concepts text book. John
Wiley & Sons, Inc. 2003.
[22] Corbett, Peter, et. al. The Vesta File System. ACM Transactions on
Computer Systems, 14(3):225-264, August 1996
[23] Rabiner, Lawrence. A Tutorial on Hidden Markov Models and selected
applications in speech recognition. In Proceedings of the IEEE, 77(2),
February 1989
[24] Vitter, Jeffery, et. al. Optimal Prefetching via Data Compression. Journal
of the ACM, 43(5), 771-793. September 1996
[25] Kroeger, Tom, et. al. Design and Implementation of a Predictive File
Prefetching Algorithm. USENIX Annual Technical conference. 2001
[26] Rolf Rabensiefner, et. al. Effective File-I/O Bandwidth Benchmark. In
Proceedings of EUROPAR 2000. 2000.
[27] NAS Application I/O benchmark – BTIO. http://parallel.nas.nasa.gov/MPI-
IO/btio/btio-download.html
[28] Len Wisniewski. et. al. Sun MPI I/O: Efficient I/O for parallel
applications. In Proceedings of SC99. 1999.
[29] Anderson, D., Object Based Storage: A Vision. http://www.t10.org
[30] Bell, Timothy., et. al. Text Compression. Pearson Education textbook
publication. February 1990.
[31] Cormen, T., et. al. Introduction to Algorithms Second Edition. MIT Press.
2001.
[32] Lustre home page. The Lustre book. http://www.lustre.org/docs.html
[33] C-DAC KSHIPRA product brochure http://www.cdacindia.com/html/ssdgblr/hpccbsw.asp
[34] Raghvendran M. Internal position paper on C-DAC Terascale Computing Facility
(CTSF) storage, 2004.
[35] TOP 500 website. http://www.top500.org
[36] PARKBENCH. http://www.performance.ecs.soton.ac.uk/index.html
[37] Intel Corporation. Paragon System User’s Guide. April 1996.
63
[38] Reed, Daniel (editor). Scalable Input/Output – Achieving System Balance. MIT Press.
2004
64
4