Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High-Performance Storage System for the LHCb Experiment Sai Suman Cherukuwada, CERN Niko Neufeld, CERN IEEE NPSS RealTime 2007, FNAL LHCb Background Large Hadron Collider beauty One of 4 Major CERN experiments at the LHC Single-arm Forward Spectrometer b-Physics, CP Violation in the Interactions of b-hadrons IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN LHCb Data Acquisition Level 0 Readout – 1 MHz High Level Trigger – 2-5 kHz Written to Disk at Experiment Site Staged out to Tape Centre L0 Electronics Readout Boards Storage at Experiment Site IEEE NPSS RealTime 2007, FNAL High-Level Trigger Farm Sai Suman Cherukuwada and Niko Neufeld, CERN Storage Requirements Must sustain Write operations for 2-5 kHz of Event Data at ~30 kB/event, with peak loads going up to twice as much Must sustain matching read operations for staging out data to tape centre Must support reading of data for Analysis tasks Must be Fault tolerant Must easily scale to support higher storage capacities and/or throughput IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Architecture : Choices HLT Farm can access Cluster File System transparently Unified Namespace IP Network Cluster File System Combines all Servers over IP to form a Single Namespace Independent Servers with Storage HLT Farm nodes are bound to specific servers Independent Servers run local File Systems on their Local Disks Cluster File System IEEE NPSS RealTime 2007, FNAL Fully Partitioned Independent Servers Sai Suman Cherukuwada and Niko Neufeld, CERN Architecture : Shared-Disk File System Cluster Event Data Writer Clients Can Scale Compute or Storage Components independently Locking Overhead restricted to very few servers Unified namespace is simpler to manage Storage fabric delivers high throughput HLT Farm nodes write data over IP using a custom protocol Fault Tolerant, Load Balanced Event Data Writer Service IP Network Servers connect to Shared Storage over a Fibre Channel Network SAN File System provides all servers with a Consistent Namespace on the Shared Storage Fibre Channel Network Shared-Disk File System Cluster IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Hardware : Components Dell PowerEdge 2950 Quad Core Servers QLogic QLE2462 4 Gbps Fibre Channel Adapters DataDirect Networks S2A 8500 Storage Controllers with 2 Gbps host-side ports 50 x Hitachi 500 GB 7200 rpm SATA Disks Brocade 200E 4 Gbps Fibre Channel Switch IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Hardware : Storage Controller 200 DirectRAID Combined Features of RAID3, RAID5, and RAID0 8 + 1p + 1s Very Low Impact on Disk Rebuild Large Sector Sizes (up to 8 kB) supported Eliminates host-side striping IEEE NPSS RealTime 2007, FNAL 180 160 140 120 MB/s e c 100 80 60 40 20 0 Re ad Write Re ad+Write Norm al Re build Sai Suman Cherukuwada and Niko Neufeld, CERN Software Writer Service Discovery Writer Service I/O Threads Writer Service Failover Thread Writer Service GFS File System Linux Logical Volume Manager Linux Multipath Driver SCSI LUNs (Logical Units) IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Runs on RAID volumes exported from Storage Arrays (called LUNs or Logical Units) Can be mounted by multiple servers simultaneously Lock Manager ensures consistency of operations Scales almost linearly up to 4 nodes (at least) (figures alongside are for GFS) 500000 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 Node4 Node3 Node2 Node1 Read Write 450000 Throughput (KB/sec) Throughput (KB/sec) Software : Shared-Disk File System 400000 350000 300000 250000 200000 Node 2 in Group 150000 Node 1in Group 100000 50000 0 Read IEEE NPSS RealTime 2007, FNAL Write Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Design Goals Enable a large number of HLT Farm Servers to write to disk Write Data to shared disk file system at close to maximum disk throughput Failover + Failback with no data loss Load Balancing between instances Write hundreds of concurrent files per server IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Discovery Discovery and status updates performed through multicast Service Table maintains current status of all known hosts Service Table contents constantly updated to all connected Gaudi Writer Processes from the HLT Farm Writer Service 1 Multicast Messages Writer Service 2 Writer Service 3 Relay Service Table Information to Writers Gaudi Writer Process 1 IEEE NPSS RealTime 2007, FNAL Gaudi Writer Process 2 Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Failover Gaudi Writer Processes aware of all instances of Writer Service Each Data Chunk is entirely self-contained Writing of a Data Chunk is idempotent If Writer Service fails, Gaudi Writer Process can reconnect and resend unacknowledged chunks IEEE NPSS RealTime 2007, FNAL Writer Service 1 Failed Connection Writer Service 2 1. 2. Gaudi Writer Process 1 3. Connect to Next Entry In Service Table, Update New Service Table Replay Unacknowledged Data Chunks Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Throughput Suboptimal cached concurrent write performance for large numbers of files Large CPU and memory load (memory copy) O_DIRECT reduces CPU and memory usage Data need to be pagealigned for O_DIRECT Written event data are not aligned to anything IEEE NPSS RealTime 2007, FNAL 200 180 160 140 120 MB/sec CPU (sys %) IO (MB/sec) 100 80 60 40 20 0 Cached O_DIRECT Sai Suman Cherukuwada and Niko Neufeld, CERN Writer Service : Throughput per Server Scales up with number of clients Write throughput within 3% of maximum achievable through GFS 200 180 160 140 120 MB/sec 100 80 60 40 20 0 1 2 4 8 16 Clients IEEE NPSS RealTime 2007, FNAL Sai Suman Cherukuwada and Niko Neufeld, CERN Thank You Writer Service Linux-HA not suited for Load Balancing Linux Virtual Server not suited for Write Workloads NFS in sync mode too slow, async mode can lead to information loss on failure Cached Operations do not IEEE NPSS RealTime 2007, FNAL 80 70 60 50 MB/sec 40 30 20 10 0 NFS async NFS sync Sai Suman Cherukuwada and Niko Neufeld, CERN