Download High-Performance Storage System for the LHCb Experiment

High-Performance Storage System for the LHCb Experiment Sai Suman Cherukuwada, CERN Niko Neufeld, CERN IEEE NPSS RealTime 2007, FNAL LHCb Background Large Hadron Collider beauty  One of 4 Major CERN experiments at the LHC  Single-arm Forward Spectrometer  b-Physics, CP Violation in the Interactions of b-hadrons  IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN LHCb Data Acquisition     Level 0 Trigger – 1 MHz High Level Trigger – 2-5 kHz Written to Disk at Experiment Site Staged out to Tape Centre L0 Electronics Readout Boards Storage at Experiment Site IEEE NPSS RealTime 2007, FNAL High-Level Trigger Farm Sai Suman Cherukuwada and Niko Neufeld, CERN Storage Requirements      Must sustain Write operations for 2-5 kHz of Event Data at ~30 kB/event, with peak loads going up to twice as much Must sustain matching read operations for staging out data to tape centre Must support reading of data for Analysis tasks Must be Fault tolerant Must easily scale to support higher storage capacities and/or throughput IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Architecture : Choices Wrter Farm Nodes can access Cluster File System transparently Unified Namespace Writer Farm Nodes IP Network Independent Servers run local File Systems on their Local Disks Fully Partitioned Independent Servers IEEE NPSS RealTime 2007, FNAL Cluster File System Combines all Servers over IP to form a Single Namespace Independent Servers with Storage Cluster File System Sai Suman Cherukuwada and Niko Neufeld, CERN Architecture : Shared-Disk File System Cluster Event Data Writer Clients     Can Scale Compute or Storage Components independently Locking Overhead restricted to very few servers Unified namespace is simpler to manage Storage fabric delivers high throughput HLT Farm nodes write data over IP using a custom protocol Fault Tolerant, Load Balanced Event Data Writer Service IP Network Servers connect to Shared Storage over a Fibre Channel Network SAN File System provides all servers with a Consistent Namespace on the Shared Storage Fibre Channel Network Shared-Disk File System Cluster IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Hardware : Components      Dell PowerEdge 2950 Intel Xeon Quad Core Servers (1.6 GHz) with 4 GB FBD RAM QLogic QLE2462 4 Gbps Fibre Channel Adapters DataDirect Networks S2A 8500 Storage Controllers with 2 Gbps host-side ports 50 x Hitachi 500 GB 7200 rpm SATA Disks Brocade 200E 4 Gbps Fibre Channel Switch IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Hardware : Storage Controller 300  DirectRAID      Combined Features of RAID3, RAID5, and RAID0 8 + 1p + 1s Very Low Impact on Disk Rebuild Large Sector Sizes (up to 8 kB) supported Eliminates host-side striping IEEE NPSS RealTime 2007, FNAL 250 200 Read MB/sec 150 Write Read+Write 100 50 0 Normal Rebuild •IOZone File System Benchmark with 8 threads writing 2 GB files each on one server •Tested first in “Normal” mode with all disks in normal health, and then in “Rebuild”, with one disk in the process of being replaced by a global hot spare Sai Suman Cherukuwada and Niko Neufeld, CERN Software Writer Service Discovery Writer Service I/O Threads Writer Service Failover Thread Writer Service GFS File System Linux Logical Volume Manager Linux Multipath Driver SCSI LUNs (Logical Units) IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Software : Shared-Disk File System 450 400 350     Runs on RAID volumes exported from Storage Arrays (called LUNs or Logical Units) Can be mounted by multiple servers simultaneously Lock Manager ensures consistency of operations Scales almost linearly up to 4 nodes (at least) 300 Node 4 250 Node 3 Node 2 MB/sec 200 Node 1 150 100 50 0 Read Write 800 700 600 500 Read MB/sec 400 Re-Read Write 300 •IOZone Test with 8 threads, O_DIRECT I/O •LUNs striped over 100+ disks •2 Gbps Fibre Channel Connections to Disk Array Re-Write 200 100 0 1 2 3 4 Servers IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Design Goals Enable a large number of HLT Farm Servers to write to disk  Write Data to shared disk file system at close to maximum disk throughput  Failover + Failback with no data loss  Load Balancing between instances  Write hundreds of concurrent files per server  IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Discovery    Discovery and status updates performed through Writer Service 1 multicast Service Table maintains Multicast current status of all known Messages hosts Service Table contents Writer Service 2 Writer Service 3 constantly updated to all connected Gaudi Writer Processes from the HLT Relay Service Farm Table Information to Writers Gaudi Writer Process 1 IEEE NPSS RealTime 2007, FNAL Gaudi Writer Process 2 Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Process : Writing Cache every event  Send to Writer Service  Wait for Acknowledgement  Flush and free  IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Failover     Writer Processes aware of all instances of Writer Service Each Data Chunk is entirely self-contained Writing of a Data Chunk is idempotent If Writer Service fails, Writer Process can reconnect and resend unacknowledged chunks IEEE NPSS RealTime 2007, FNAL Writer Service 1 Failed Connection Writer Service 2 1. 2. Gaudi Writer Process 1 3. Connect to Next Entry In Service Table, Update New Service Table Replay Unacknowledged Data Chunks Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Throughput 200 180    Cached concurrent write performance for large numbers of files is insufficient Large CPU and memory load (memory copy) O_DIRECT reduces CPU and memory usage   IEEE NPSS RealTime 2007, FNAL Data need to be pagealigned for O_DIRECT Written event data are not aligned to anything 160 140 120 CPU (sys %) IO (MB/sec) MB/sec 100 80 60 40 20 0 Cached O_DIRECT •Custom test writing 32 files per thread x 8 threads •Write sizes varying from 32 bytes to 1 MB •LUNs striped over 16 disks •2 Gbps Fibre Channel Connections to Disk Array Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Throughput per Server 200   Scales up with number of clients Write throughput within 3% of maximum achievable through GFS 180 160 140 120 MB/sec 100 80 60 40 20 0 1 •Custom test writing event sizes ranging from 1 byte to 2 MB •LUNs striped over 16 disks •2 Gbps Fibre Channel Connections to Disk Array •2 x 1 Gbps Ethernet Connections to Server •CPU Utilisation ~ 7-10% IEEE NPSS RealTime 2007, FNAL 2 4 8 16 Clients Sai Suman Cherukuwada and Niko Neufeld, CERN Conclusions & Future Work      Solution that can offer high read and write throughput with minimal overhead Can be scaled up easily with more hardware Failover with no performance hit More sophisticated “Trickle” load-balancing algorithm in the process of being prototyped Maybe worth implementing a full-POSIX FS version someday? IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Thank You Writer Service    Linux-HA not suited for Load Balancing Linux Virtual Server not suited for Write Workloads NFS in sync mode too slow, async mode can lead to information loss on failure IEEE NPSS RealTime 2007, FNAL 80 70 60 50 MB/sec 40 30 20 10 0 NFS async NFS sync Sai Suman Cherukuwada and Niko Neufeld, CERN

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download High-Performance Storage System for the LHCb Experiment