* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 幻灯片 1 - Universidad Abierta Interamericana
Survey
Document related concepts
Transcript
Large Scale Parallel File System and Cluster Management ICT, CAS About ICT, CAS • Institute of Computing Technology, Chinese Academy of Science • The first (from 1958) and largest national IT research institute in China • The largest graduate school of Computer Science in China • Builder of most Chinese systems in HPC TOP 500 • Focusing on computing system architecture: CPU, Compiler, Network, Grid, HPC and Storage Storage Centre of ICT • Founded in 2001 • Leader: Dr. Xu Lu (from HP Lab) • Storage for scientific computing – BWFS: Parallel cluster file system – Service on Demand system: Storage-based cluster management system. • Storage for business computing – VSDS: Virtual storage research project – Backup / Virtual Computing…… The Storage Bottleneck of Cluster • NFS (Network File System) – Most widely used in clusters to provide shared data access – Simple and easy to use and management • Scalability Problem – Multiple NFS server means multiple name space – Hard to extend in capacity. – The performance do not increase with the capacity – Poor performance in I/O density computing – Weak MS Windows support 数据吞吐率(KB) • Parallel Access Problem 80000 70000 60000 50000 40000 30000 20000 10000 0 4k 8k 32k 64k 1M 2M 1 2 4 8 计算节点个数 16 32 What’s BWFS • Parallel network file system – Support multiple storage appliances (8-128) in a single name space (Up to 512 TB) – Separated Data and Meta-Data access to provide parallel accessing between different storage appliance • Global name space between clients with different platforms – Fully compatible with NFS (not 100% POSIX) – Support data sharing between Linux and Windows clients – Support IA32, IA64 and x86_64 hardware platforms What’s BWFS • Centralized Management – Web based management for the storage appliances and the storage sub-system – Integrated client management with Service on Demand system. • Online extension – Add storage appliances to increase the capacity without stopping the application – The new data will be automatically stripped between all the storage appliances to get a high performance. Data Access on NFS Meta-Data User Data Application Server Storage Appliance ` Application Server Data Access on BWFS Meta-Data User-Data 元数据控制器 节点服务器 存储设备 节点服务器 存储设备 350 write large files(20G per node, 1MB record size) 300 1SN 250 200 150 100 50 0 1 2 4 Aggregate Bandwidth(MB/s) Aggregate Bandwidth (MB/s) Bandwidth of BWFS read large files(20G per node, 1MB record size) 2SN 350 4SN 300 1SN 250 NFS 200 2SN 150 4SN 8 100 16 Number of client nodes NFS 50 0 1 2 4 8 Number of client nodes 16 计算能力(线/小时) Paradigm Epos3 (China Petrol, Xinjiang) 10 BWFS NAS9500 RackServer+DawningNFS 8 6 4 2 0 32 64 96 节点数 128 Paradigm Disco (China Petrol, Xinjiang) 2500 RaidsysNFS 线性 (BWFS) 运行平均时间(秒) BWFS NAS8500 NAS9500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 节点数(每节点一个作业) 12 13 14 15 Management Interface Service on Demand System • Initially developed as a subsystem of BWFS to provide cluster management • Reduce the management work especially in the system deployments • Increase the availability against the storage components fail • Enable the fast schedule in large server farms with multiple clusters • Boot the system directly from the BWFS storage appliance without the need of local hard disks Traditional Cluster Deployment System 20mins 硬盘 硬盘 硬盘 硬盘 硬盘 硬盘 系统映像 Shortcoming 1: Inefficiency in Schedule 20 mins 硬盘 硬盘 硬盘 硬盘 硬盘 硬盘 系统映像 系统映像 2 Shortcoming 2: Inefficiency in Maintains 硬盘 硬盘 系统映像 系统映像 2 硬盘 Hard disk errors occupy 30%-50% of all the computer system errors 硬盘 硬盘 硬盘 Shortcoming 3: Inefficiency in Capacity A 5GB system on a 74GB hard disk 硬盘 硬盘 系统映像 系统映像 2 硬盘 The disks are getting larger and larger but the system images are keeping small to reduce deployment time 硬盘 硬盘 硬盘 Service on Demand System • Diskless boot OS by TCP/IP – Virtual SCSI disk to support Windows and Linux – Fully compatible with applications • Provide high performance snapshots to support fast cloning of system images – Copy on Write when the system image is modified – Online backup system image with snapshot • Automatic take over on failed clients • Integrated monitor engine to support automatic schedule or adaptive computing (still in researching) Service on Demand System Service 2 Service N Service 1 Network Map to Local Disk User Storage Appliance Application Node Fast Deployment and Schedule Paradigm Services CGG Services Paradigm Image Web 系统 Paradigm Snapshot Paradigm Snapshot Paradigm Snapshot Email CGG Image 系统 CGG Snapshot Paradigm Snapshot Paradigm Snapshot Easy to maintain Maintenance System Image System Snapshot System Snapshot System Snapshot System Snapshot System Snapshot Management UI SERVER 73GB×3硬盘,4GB MEMORY, 2 CPU 大内存节点 计算节点 17台 73G硬盘,2 CPU, 4GB MEMORY 4 CPU, 8GB MEMORY 4T盘阵 服务网络 千兆以太网 注:挂DVD刻录机 InfiniBand 部署、管理网络 百兆以太网 虚拟存储 管理服务器 (物理机器 为Console) 3T盘阵×3 IP SAN 千兆 平台管理访问局域网 部署系统 设备1TB Internet 备用节点 4 CPU, 4GB MEMORY Console(曙光PC) 龙芯NC 两台 曙光PC Thanks 谢谢!