Download Storage Challenges for Petascale Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Transcript
Storage Challenges for
Petascale Systems
Dilip D. Kandlur
Director, Storage Systems Research
IBM Research Division
© 2004
IBM Corporation
IBM Almaden Research Center
Outline
ƒ
ƒ
ƒ
ƒ
ƒ
2
Storage Technology Trends
Implications for high performance computing
Achieving petascale storage performance
Manageability of petascale systems
Organizing and finding Information
© 2006 IBM Corporation
IBM Almaden Research Center
10
Extreme Scaling
ƒ There have been recent inflection
points in CAGR of processing and
storage – in the wrong direction!
ƒ Programs like HPCS are aimed at
maintaining throughput at or above
the CAGR of Moore’s Law in spite of
these technology trends
Frequency (GHz)
2002 Roadmap
~35% yr/yr
2003
10-15% yr/yr
5
Prescott
(90 nm)
4
3
Pentium 4
(130 nm)
2
Pentium 4
(180 nm)
1
2000
2001
2002
2003
2004
Initial Ship Date
2005
2006
2007
1000
Disk Areal Density Trend 2000-2010
Areal Density
Gb/sq.in.
Maximum internal Bandwidth MB/s
Maximum Internal Bandwidth
100
10
1998
3
2000
2002
2004
2006
© 2006 IBM Corporation
2008
2010
10000
1000
100
10
25-35% CAGR
100% CAGR
1
0.1
1998
2000
2002
2004
2006
Year of Production
2008
2010
IBM Almaden Research Center
Peta-scale systems: DARPA HPCS, NSF Track 1
ƒ HPCS goal: “Double value every 18 months” in the face of flattening technology curves
ƒ NSF track 1 goal: at least a sustained petaflop for actual science applications
ƒ New technologies like multi-core will keep processing power on the rise but will make
ƒ
storage relatively more expensive
Maintaining “balanced system” scaling constants for storage will be expensive
– Storage bandwidth: .001byte/second/flop capacty: 20 bytes/flop
– Cost per drive will be same order of magnitude, so proportionally the same amount of storage will
be a higher fraction of total system cost
ƒ How to make reliable a system with 10x today’s number of moving parts?
System
4
Year
TF
GB/s
Nodes
Cores
Storage
Disks
Blue P
1998
3
3
1464
5856
43 TB
5040
White
2000
12
9
512
8192
147 TB
8064
Purple/C
2005
100
122
1536
12288
2000 TB
11000
NSF Track 1
(possible)
2011
2000
2000
10000
300000
40000 TB
50000
© 2006 IBM Corporation
IBM Almaden Research Center
HPCS
Storage
165,000 drives
100000
11,000 drives
5,000 drives
6 PF
1000
6 TB/s
100 TF
10
0.1
120 GB/s
4 TF
0.0013.6 GB/s
1995 1995 2000
2000
20052005 2010
CPU Performance
CPU Performance
Number of Disk Drives
Number of Disk Drives
Fast
2010
2015
File System Capacity
File System Throughput
File System Throughput
300,000 processors
150,000 disk drives
5 TB/sec sequential bandwidth
30,000 file creates/sec on one node
Capable of running fsck on 1 trillion files
Managable
Fix 3 or more concurrent errors
Detect “undetected” errors
Only minor slowing during disk rebuild
Detect and manage slow disks
Unified manager for files, storage
End-end discovery, metrics, events
Managing system changes, problem fixes
GUI scaled to large clusters
5
© 2006 IBM Corporation
Robust
IBM Almaden Research Center
GPFS Parallel File System
GPFS file system nodes
ƒ
ƒ
ƒ
6
Cluster: thousands of nodes,
fast reliable communication,
common admin domain.
Shared disk: all data and
metadata on disk accessible
from any node, coordinated by
distributed lock service.
Parallel: data and metadata
flow to/from all nodes from/to
all disks in parallel; files striped
across all disks.
© 2006 IBM Corporation
Control IP network
Disk FC network
GPFS file system nodes
Data / control IP network
GPFS disk server nodes:
VSD on AIX, NSD on
Linux – RPC interface to
raw disks
IBM Almaden Research Center
Scaling GPFS
ƒ
HPCS file system performance and scaling targets
– “Balanced system” DOE metrics (.001B/s/F, 20 B/F)
• This means 2-6 TB/s throughput, 40-120 PB storage!!
– Other performance goals
•
•
•
•
7
30 GB/s single node to single file for data ingest
30K file opens per second on a single node
1 trillion files in a single file system
Scaling to 32K nodes (OS images)
© 2006 IBM Corporation
IBM Almaden Research Center
Extreme Scaling: Metadata
ƒ
Metadata: the on-disk data structures that represent hierarchical
directories, storage allocation maps, …
ƒ
Why is it a problem? Structural integrity requires proper synchronization.
Performance is sensitive to the latency of these (small) I/O’s.
ƒ
Techniques for scaling metadata
– Scaling synchronization (distributing the lock manager)
– Segregating metadata from data to reduce queuing delays
•
•
–
–
–
–
8
Separate disks
Separate fabric ports
Different RAID levels for metadata to reduce latency, or solid-state memory
Adaptive metadata management (centralized vs. distributed)
GPFS provides for all these to some degree; work always ongoing
Sensible application design can make a big difference!
© 2006 IBM Corporation
IBM Almaden Research Center
Data loss in Petascale Systems
ƒ
Petaflop systems require tens to hundreds
of petabytes of storage
ƒ
Evidence exists that manufacturer MTBF
specs may be optimistic (Schroeder &
Gibson)
ƒ
Evidence exists that failure statistics may
not be as favorable as simple exponential
distribution
Hard error rate of 1 in 1015 means one
rebuild in 30 will get an error
–
ƒ
RAID-5 is dead at petascale; even RAID-6
may not be sufficient to prevent data loss
–
–
–
9
Rebuild of 8+P array of 500GB drives reads 4TB,
or 3.2×1013 bits
Simulations of file system size, drive MTBF,
failure probability distribution show 4%-28%
chance of data loss over five-year lifetime for
8+2P code
Stronger RAID (8+3P) increase MTTDL by 3-4
orders of magnitude for extra 10% overhead.
Stronger RAID is sufficiently reliable even for
unreliable (commodity) disk drives
© 2006 IBM Corporation
10000000
1000000
M T T D L in y e a rs
ƒ
MTTDL in years for 20PB system
100000
10000
1000
4%
100
16%
28%
10
1
8+3P 600K hrs
Exponential
8+3P 300K hrs
Exponential
8+2P 600K hrs
Exponential
8+2P 300K hrs
Exponential
Configuration
8+3P 600K hrs
Weibull
8+2P 600K hrs
Weibull
IBM Almaden Research Center
GPFS Software RAID
ƒ
ƒ
Implement software RAID in the GPFS NSD server
Motivations
–
–
–
–
–
ƒ
Better fault-tolerance
Reduce the performance impact of rebuilds and slow disks
Eliminate costly external RAID controllers and storage fabric
Use the processing cycles now being wasted in the storage node
Improve performance by file-system-aware caching
Approach
– Storage node (NSD server) manages disks as JBOD
– Use stronger RAID codes as appropriate (e.g. triple parity for data and multi-way
mirroring for metadata)
– Always check parity on read
•
Increases reliability and prevents performance degradation from slow drives
– Checksum everything!
– Declustered RAID for better load balancing and non-disruptive rebuild
10
© 2006 IBM Corporation
IBM Almaden Research Center
Declustered RAID
Partitioned RAID
Declustered RAID
16
logical
tracks
20 physical disks
11
© 2006 IBM Corporation
20 physical disks
IBM Almaden Research Center
Rebuild Work Distribution
failed disk
Relative read
and write
throughput for
rebuild
12
© 2006 IBM Corporation
IBM Almaden Research Center
Rebuild (2)
Upon the first failure, begin rebuilding
the tracks that are affected by the
failure (large arrows).
Many disks involved in performing
rebuild, so work is balanced, avoiding
hot spots.
13
© 2006 IBM Corporation
IBM Almaden Research Center
Declustered vs. Partitioned RAID
Simulation
results
Data losses per year per 100PB
1E+5
1E+4
partitioned
1E+3
distributed
1E+2
1E+1
1E+0
1E-1
1E-2
1E-3
1
2
Failure tolerance
14
© 2006 IBM Corporation
3
IBM Almaden Research Center
Autonomic Storage Management
Making Complex Tasks Simple
IBM TotalStorage Productivity Center
Standard Edition
A Single Application
with modular components
Disk
Data
Fabric
Ease of Use
Business Resiliency
Integrated Repication Manager
Metro Disaster Recovery
Global Disaster Recovery
Cascaded Disaster Recovery
Application Disaster Recovery
Streamlined Installation and
Packaging
Single User Interface
Single Database
Single Set of services for
consistent administration and
operations
Console Enhancements
Policy Based Storage
Management
End-to-End DataPath Explorer
Integrated Storage Planner
Configuration Change Rover
Configuration Checker
Personalization
TSM Integration
15
© 2006 IBM Corporation
SAN Best Practices
SAN Configuration Validation
Storage Subsystem Planning
Fabric Security Planning
Host Planning (Multi-path)
IBM Almaden Research Center
Integrated Management
Seamlessly integrate systems management across servers, storage and network &
provide end-to-end problem determination and analytics capabilities
Integrated Web 2.0 GUI
Best Practices
Orchestration
Systems
Knowledge
Deployment
Discovery
Applications
Middleware
OS
Applications
Middleware
Operating
systems
Virtualization
software
Hardware
16
Analytics
DB
© 2006 IBM Corporation
Monitoring
Reporting
File System
Configuration
Server
Network
Storage
IBM Almaden Research Center
PERCS Management
•A unified and standards based
management for GPFS and PERCS
Storage
PERCS GUI
CIM Client
Uses
• A GUI that is designed for
large-scale clusters
•Supporting PERCS scale
•GPFS
CIMOM
GPFS
•Information collection: asset tracking,
end-end discovery,
metrics, events
•Management: system changes,
problem fixes, configuration changes
Retrieves
CIM
Data
Provider
Uses
PERCS
CIM
Systems
Repository
DB
Storage
CIM
•Rich visualizations to help them
maintain situational awareness of
system status
•Essential for large systems
•Also enable GPFS to satisfy
commercial customers requiring easeof-use
© 2006 IBM Corporation
Server
File System
•The PERCS UI will support:
17
Management
Model
Simulator
IBM Almaden Research Center
Analytics
ƒ
Problem Determination and Impact Analysis
– Root cause analysis: discover the finest-grain events that indicate the root
cause of the problem
– Symptom suppression: correlate alarms/symptoms caused by a common cause
across the integrated infrastructure
ƒ
Bottleneck Analysis
– Post-mortem, live and predictive analysis
ƒ
Workload and Virtualization Management
– Automatically monitor multi-tiered, distributed, heterogeneous or homogeneous
workloads
– Migrate virtual machines to satisfy performance goals
ƒ
Integrated Server, Storage and Network Allocation and Migration
– Integrated allocation accounting for connectivity, affinity, flows, ports based on
performance workloads
ƒ
Disaster Management
– Provides integrated server/storage disaster recovery support
18
© 2006 IBM Corporation
IBM Almaden Research Center
Visualization
ƒ
Integrated Management is centered around Topology Viewer capabilities
based on Web 2.0 technologies
–
–
–
–
–
–
19
Data Path Viewer for Applications, Servers, Networks and Storage
Progressive Information Disclosure
Semantic Zooming
Information Overlays
Mixed Graphical and Tabular Views
Integrated Historical and Real Time Reporting
© 2006 IBM Corporation
IBM Almaden Research Center
The Changing Nature of Archive
Current Archive: Data landfill
ƒ Store and forget
ƒ Not easily accessible, typically offline
Emerging Archive: Leverage
information for business
advantage
and offsite with access time measured in
days
ƒ Not organized for usage, retained just in
case needed
ƒ Readily accessible, access time
measured in seconds
ƒ Indexed for effective discovery
ƒ Mined for business value
20
© 2006 IBM Corporation
IBM Almaden Research Center
Building Storage Systems Targeted at Archive
Scalability
ƒScale to huge capacity
™Exploit tiered storage with disk and tape
™Leverage commodity disk storage
ƒHandle extremely large number of objects
™Support high ingest rates
™Effect data management actions in a scalable fashion
Emerging Archive: Leverage
info for business advantage
Functionality
ƒConsistently handle multiple kinds of objects
ƒManage and retrieve based on data semantics
™E.g. Logical groupings of objects
ƒSupport effective search and discovery
ƒProvide for compliance with regulations
Current
Archive:
Data landfill
Reliability
ƒEnsure data integrity and protection
ƒProvide media management and rejuvenation
ƒSupport long-term retention
21
© 2006 IBM Corporation
IBM Almaden Research Center
GPFS Information Lifecycle Management (ILM)
ƒ
GPFS ILM abstractions
GPFS Clients
– Storage pool – group of LUNs
– Fileset – subtree of a file system
namespace
– Policy – rule for file placement,
retention, or movement among pools
ƒ
Application
Application
Application
GPFS
Placement
Policy
Placement
Posix
Policy
GPFS
Placement
Policy
GPFS
Placement
Policy
Application
GPFS
ILM Scenarios
GPFS Manager Node
•Cluster manager
•Lock manager
•Quota manager
•Allocation manager
•Policy manager
GPFS RPC Protocol
– Tiered storage – fast storage for
frequently used files, slower for
infrequently used files
– Project storage – separate pools for
each project, each with separate
policies, quotas, etc.
– Differentiated storage – e.g. place
media files on media-friendly storage
(QoS)
Storage Network
Gold
Pool
System Pool
Silver
Pool
Pewter
Pool
Data Pools
GPFS File System (Volume Group)
22
© 2006 IBM Corporation
IBM Almaden Research Center
GPFS 3.1 ILM Policies
ƒ
ƒ
Placement policies, evaluated at file creation,
example
GPFS Clients
Application
Application
Application
GPFS Placeme
Posix
Application
nt Policy
GPFS
Placeme
Migration policies, evaluated periodically
nt Policy
GPFS Placeme
GPFS Placeme
nt Policy
nt Policy
ƒ
GPFS Manager Nod
Deletion policies, evaluated periodically
•Cluster manager
•Lock manager
•Quota manager
•Allocation manager
•Policy manager
GPFS RPC Protocol
Storage Network
Gold
Pool
System Pool
Silver
Pool
Pewter
Pool
Data Pools
GPFS File System (Volume Group)
23
© 2006 IBM Corporation
IBM Almaden Research Center
GPFS Policy Engine
ƒ
Migrate and delete rules scan the file system to identify
candidate files
– Conventional backup and HSM systems also do this
• Usually implemented with readdir() and stat()
• This is slow – random small record reads, distributed locking
• Can take hours or days for a large file system
ƒ
GPFS Policy Engine uses efficient sort-merge rather than slow
readdir()/stat()
–
–
–
–
24
Directory walk builds list of path names (readdir() but no stat()!)
List sorted by inode number, merged with inode file, then evaluated
Both list building and policy evaluation done in parallel on all nodes
… > 105 files/sec per node!
© 2006 IBM Corporation
IBM Almaden Research Center
Client
Domain
Client Cluster Computers
Storage Hierarchies – the old way
HPSS
API
HPSS 6.2 API Architecture
Normally implemented one of two ways:
ƒ Explicit control
IP Network
– archive command (IBM TSM, Unitree)
– copy into special “archive” file system
(IBM HPSS)
– copy to archive server (HPSS, Unitree)
– … all of which are troublesome and
error-prone for the user
ƒ Implicit control through an interface like
DMAPI
– File system sends “events” to HSM
system (create/delete, low space)
– Archive system moves data and
punches “holes” in files to manage
space
– Access miss generates event; HSM
system transparently brings file back
1. Client
issues HPSS
Write or Put to
HPSS Core
Server
2. Client transfers
file to HPSS Disk
or Tape Over
TCP/IP LAN or
WAN using an
HPSS Mover
HPSS SAN
Disk
HPSS
Data
Store
HPSS FC SAN
Tape
Libraries
HPSS Cluster Computers
Core Server and Movers
Metadata
Disks
HSM Control
Information
HPSS
GPFS
Session
Node
HSM
processes
DB2
IP LAN
Data
transfers
DB2
Tape –
disk
transfers
GPFS I/O
Nodes
HPSS
HPSS
Movers
Interface
Moverless SAN
Data
transfers
GPFS
Disk
Arrays
GPFS Cluster
HPSS
Disk
Arrays
HPSS
Tape
Libraries
HPSS Cluster
GPFS 3.1 and HPSS 6.2 DMAPI Architecture
25
© 2006 IBM Corporation
HPSS
Core
Server
IBM Almaden Research Center
DMAPI Problems
ƒ Namespace events (create,
delete, rename)
– Synchronous and recoverable
– Each is multiple database
transactions
– Slow down the file system
ƒ Directory scans
– DMAPI low-space events trigger
directory scans to determine what
to archive
•
As a result, data movement policies
are usually hard-coded and
primitive
ƒ Read/write managed region
– Blocks the user program while data
brought back from HSM system
– Parallel data movement isn’t in the
spec, but everyone implements it
anyway
– Data movement is actually the one
thing about DMAPI worth saving
26
HPSS
GPFS
Session
Node
HSM
processes
DB2
IP LAN
Data
transfers
DB2
HPSS
Core
Server
Tape –
disk
transfers
can take hours or days on large FS
– Scans have little information upon
which to make archiving decisions
(what you get from “ls –l”)
•
HSM Control
Information
© 2006 IBM Corporation
GPFS I/O
Nodes
HPSS
HPSS
Movers
Interface
Moverless SAN
Data
transfers
GPFS
Disk
Arrays
GPFS Cluster
GPFS 3.1 and HPSS 6.2 DMAPI
Architecture
HPSS
Disk
Arrays
HPSS
Tape
Libraries
HPSS Cluster
IBM Almaden Research Center
GPFS Approach: “External Pools”
ƒ
External pools are really interfaces to external storage
managers, e.g. HPSS or TSM
– External pool “rule” defines script to call to migrate/recall/etc. files
RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’]
ƒ
GPFS policy engine builds candidate lists and passes them to
external pool scripts
ƒ
External storage manager actually moves the data
– Using DMAPI managed regions (read/write invisible, punch hole)
– Or using conventional Posix API’s
27
© 2006 IBM Corporation
IBM Almaden Research Center
GPFS ILM Demonstration
NERSC
Oakland, CA
HPSS
Archive on tapes with
disk buffering,
connected via 10Gb link
High bandwidth,
parallel data
movement across
all devices and
networks
28
© 2006 IBM Corporation
SC’06
Tampa, FL
GPFS
1M active files
FC, SATA disks
IBM Almaden Research Center
Nearline Information – conceptual view
NFS/CIFS
Client
NFS/CIFS
Server
TSM Archive
Client/API
Admin /
Search
TSM Archive
API
• Provides capability to handle extended metadata
Scale-out Archiving Engine
(GPFS Cluster)
•Meta-data may be derived from data content
Global Index
and Search Capability
DMAPI
TSM Archive Client
Migration via TSM
Archive Client
TSM Deep Storage
29
© 2006 IBM Corporation
•Extended attributes (integrity code, retention
period, retention hold status, and any application
meta-data)
• Global index on content and EA meta-data
• Allow for application-specific parsers (e.g.,
DICOM)
IBM Almaden Research Center
Summary
ƒ
Storage environments moving from petabytes to exabytes
– Traditional HPC
– New archive environments
ƒ
Significant challenges for reliability, resiliency, and manageability
ƒ
Meta-data becomes key for information organization and discovery
30
© 2006 IBM Corporation