Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Recording Model at XFEL CRISP 2nd Annual meeting March 18-19, 2013 Djelloul Boukhelef Djelloul Boukhelef - XFEL 1 Outline • • • • Purpose and scope Hardware setup Software architecture Experiments & results – Network – Storage • Summary & outlook Djelloul Boukhelef - XFEL 2 Purpose and present scope • Build a prototype of a fully featured DAQ/DM/SC system – Select/install adequate h/w, develop s/w, and test all system’s properties: Control, DAQ, DM, and SC systems • Current prototype focuses on: – Data acquisition, pre-processing, formatting and storage – Assess the performance and stability of the h/w + s/w • Network: bandwidth (10Gbps), UDP packets loss, TCP behavior… • Processing: concurrent read, processing, write operations, … • Storage: performance of disk (write), concurrent IO operations, … • Software development – Application architecture: processing pipeline, communication, … – Design for performance, robustness, scalability, flexibility, … Djelloul Boukhelef - XFEL 3 Hardware setup • 2D detector generates ~10GB of image data per second • • Data is multiplexed on 16 channels (10GbE) 1GB/1.6 sec = 640MB/s per channel • Lots of other slow data streams Details were presented in the IT&DM meeting in October Djelloul Boukhelef - XFEL 4 SOFTWARE ARCHITECTURE Djelloul Boukhelef - XFEL 5 Overview • Current prototype consists of three software components – Data feeder UDP • No TB board (Train builder emulator) Feed the PCL with train data • With TB board: Feed TB with detector data Train Builder Emulator Data Feeder 1 UDP – PC Layer software • Acquire, pre-process, reduce, monitor, format, send data to storage and SC Data Feeder M PC Layer PCL node 1 PCL node N – Storage service • Device/DeviceServer model to build a distributed control system – PSR, Flexibility and Configurability Djelloul Boukhelef - XFEL TCP Storage service Storage node 1 Storage node S 6 Network Timer Train Builder Layer • Groups (Id, rate, ...) Timer server • CPU, Network TCP Train Builder Emulator (Master) • Net-timer • PCL nodes Architecture overview Data-driven model • CPU, Network TCP Configurations (xml files) Data Feeder 1 Data Feeder 2 Data Feeder M • TB-Emulator (Master) • Train Metadata • Data files • CPU, Queues, Network PC Layer UDP PCL node 1 PCL node 2 PCL node N • Storage nodes • Train Metadata • CPU, Queues, Network Storage Layer TCP • Folder • Naming convention Storage node 1 Storage node 2 Storage node N Djelloul Boukhelef - XFEL • CPU 7 Processing pipeline Build Pipelining and multithreading on multi-core T2 T1 T2 Send T1 T2 PC Layer Receive Process T1 T2 T1 Format T2 T1 Send Djelloul Boukhelef - XFEL T2 T1 Storage Generate T1 Processing Train builder Acquisition • Distributed and parallel processing T2 Receive Write Online analysis SC T1 T2 T1 T2 8 Images Data files (offline) Image Data Generator Detector Data Generator Generate & Store (CImg) DAQ request Detector emulator Load Image data queue (pointers) Train Builder Emulator Data Feeder Raw data buffer Detector data queue (pointers) Train Builder Trains buffer Train data queue (tokens) Packetizer Clock-ticks (TCP) Packets (UDP) PCL nodes Djelloul Boukhelef - XFEL Timer server Train builder PC-Layer node Packets (UDP) De-packetizer Process queue (train id) Trains buffer Processing Pipeline Statistics, … (Monitoring, reduction, …) Format queue (train id) Formatter Simple processing: checksum Need real algorithms In memory files Write queue (file name) Writer TCP stream Online storage Djelloul Boukhelef - XFEL 10 Storage server TCP network • Memory ring buffer read – Buffer size (16G) – Slot (chunk) size (1G) • Threads pool Reader Get free slot Fill slot – Reader: Read data stream (TCP) Ring and store it into the memory buffer buffer – Writer: Write filled buffers slots into disk (cache) – Sync: Flush cache to disk (physical writing) Flush slot Writer • IO modes – Cached: issues IO command via OS – Direct: talk directly to the IO device – Splice: data transfer within kernel space Djelloul Boukhelef - XFEL write Free slot Sync Disk array 11 NETWORK PERFORMANCE Djelloul Boukhelef - XFEL 12 Train Transfer Protocol (TTP) Header • Train data format X – Header, images & descriptors, detector data, trailer sof X 1 • Train Transfer Protocol (TTP) – Based on UDP: designed for fast data transfer where TCP is not implemented or not suitable (overhead, delays) – Transfers are identified by unique identifiers (frame number) – Packetization: bundle the data block (frame) into small packets that are tagged with increasing packet numbers Data – Flags: SoF, EoF, Padding 8kbytes – Packet trailer mode Djelloul Boukhelef - XFEL SoF, EoF # Frame 4b 1b # Packet 0 Images data X 2 X 3 X 4 Images descriptors X 5 X Detector specific data 6 X Trailer 7 Train data X eof 13 8 Previous results (reminder) Unidirectional stream • Two types of programs run in parallel on all machines PCL Node 1 PCL Node 2 – Feeders: generate train data, packetize it into packets, send them using UDP Concurrent send/receive – Receivers: reconstruct train data from packets, 1 5 store them in memory (overwrite) • Run length (#packets/total time): – Typical: ~ 2.5×109 packets (few hours) – Maximum: 5×109 packets (16h37m) 2 6 3 7 4 8 3.5×108 • Time profile – XFEL: 10 MHz for 16 channels Send 1 train (~131100 packets) within 1.6sec – Continuous: no waiting time between trains sending Djelloul Boukhelef - XFEL 14 Previous results (reminder) • Network transfer rate: 1GB train 0.87sec ≈ 9.9Gbps • CPU usage (ie. receiver core) 40% • Packets loss – Few packets (tens to hundreds) are sometimes lost per run – It happens only at the beginning of some runs (not train) – Observed sometimes on all machines, some machines only. We have run with no packet loss on any machine, also • Ignoring first lost packets which affect only the first train – Typical run (3.5×108) – Long run (5×109) million trains less than 3.7 out of 10000 trains less than 26 out of one Djelloul Boukhelef - XFEL 15 Train switching • In previous experiments: Feeder 1 – Each feeder is configured to feed one PC layer node (one-to-one) – Packet loss appears at the start of a run – In TB, trains are sent out through different channels every time 10 trains 16 channels per second. • Question: Feeder 2 st101 st102 TTP 10GbE Switch Sub-net 1 PCLayer 1 PCLayer 2 st105 st104 Feeder 1 st101 Feeder 2 Feeder 3 st102 – What if the feeder sends train data to a different PC layer node every time? st103 TTP 10GbE Switch Sub-net 1 PCLayer 1 st104 Djelloul Boukhelef - XFEL PCLayer 2 st105 16 Train switching Timer st401 • Test configuration Train builder – 3 feeders nodes • • • • st401 Pre-load images from disk Build train data (header, trailer,…) Feeder 1 Calculate checksum (Adler32) st101 Packetize train data (TTP) – 2 PC layer nodes • • • • Feeder 2 Feeder 3 st102 st103 TTP 10GbE Switch Sub-net 1 Depacketize (TTP) No processing is performed Format to HDF5 file Stream files through TCP (splice) PCLayer 1 PCLayer 2 st105 st104 TCP 10GbE Switch – 2 storage nodes Sub-net 2 • Write files to shared memory (splice) Djelloul Boukhelef - XFEL Storage 1 Storage 2 st106 st107 17 Train switching • 3 Feeders feed 2 PC layer nodes in round robin manner – Rate: 2 trains every 1.6 second Feeder 1 • A PC layer node receives 1 train every st101 1.6 sec, each time from a different feeder • A Feeder sends out 1 train every 2.4 sec, each time to a different IP address – Packetizer checks the send buffer in order to avoid overwriting previous (not sent yet) packets, eg. every 100 packets. – All Feeder-to-PCLayer data transfers are done on the same sub-network – Train transfer time is .88 sec, ie. there is an overlap between two consecutive trains of 0.08sec (9% of the time) Djelloul Boukhelef - XFEL Feeder 2 Feeder 3 st102 st103 TTP 10GbE Switch Sub-net 1 PCLayer 1 PCLayer 2 st104 Feeder 1 2 3 1 2 3 1 … st105 PCL 1 2 1 2 1 2 1 Time 0.0 0.8 1.6 2.4 3.2 4.0 4.8 18 Experiment • Total run time – Short time: less than ½ hour – Long time: 18 hours (81657 trains ~ 80TB) • Observations: – 6 trains were affected at the beginning of the run for each pc layer node, than they continue smoothly with no train loss until 1am. – … 8am, 2 more trains were affected at one PC layer (probably due to the nightly update). No train loss on the other node. – … than the run continues very stable until the end – Network send and receive (TTP and TCP) were very stable – Formatting time was not stable al the time Djelloul Boukhelef - XFEL 19 Summary • Results: – Trains sent: 81657 (27219 trains per feeder) – PCLayer01 (received:40820, affected: 8) – PCLayer02 (received:40823, affected: 6) – Train size: 1073754173 = 1GB (image data) + 12.06MB (header, descriptors, detector data, and trailer) – # packets per train: 131202 – Transfer time: 0.877579 sec – Transfer rate: 1.1395 GBytes/sec = 9.116 Gbps • Sustainable and stable network bandwidth Djelloul Boukhelef - XFEL 20 STORAGE PERFORMANCE Djelloul Boukhelef - XFEL 21 Cached IO Issue read/write commands via OS kernel, which will execute the IO. Data are copied to/from page cache. Direct IO vs Cached IO User land write read Application buffer Direct IO Zero-copy operation Splice socket and file descriptors. Kernel space Cached IO Perform data transfer within the kernel Page remapping space (transparent). Problem using Linux splice function with IBM/Mellanox Page cache Kernel buffer Couldn’t figure out the reason!! flush Direct IO Performs read/write directly to/from the device, bypassing page cache Device driver DMA: Memory alignment (512) RAID: strip size (64K), # of disks Per file vs. per partition (Linux) Djelloul Boukhelef - XFEL Disk Hardware layer Device driver Network 22 Cached IO Time (sec) Time (sec) • Run length: 2+ hours • Buffer size: 4GB • Method: read all file data into one buffer, write it once, then sync File number File number File size: 1GB Time period: 1.6s Disk: empty File size: 2GB Time period: 3.2s Disk: empty Djelloul Boukhelef - XFEL 23 Run configuration Dell machines PCL Node 101 • Two types of programs run in parallel on four machines PCL Node 102 PCL Node 103 PCL Node 104 Storage 202 Storage 203 Storage 204 – Sender: open in memory data file, stream its content using TCP, close file. – Receiver: read file data from socket to RAM buffer (16GB), write it to disk. • Run length – Typical: 4.5 TB (2 hours) – Maximum: until disk is full 9TB (4 hours) and 28TB(12 hours) – Disks are cleaned before every run • Time profile – 1G 1.6sec per box Storage 201 External disks Internal disks IBM machines Size (GB) 1 10 20 40 Time (sec) 1.6 16 32 64 Djelloul Boukhelef - XFEL 24 Direct IO – external storage 50 read write overall 40 Time(sec) • Storage extension box: 3TB 7.2Krpm 6Gbps NL SAS, RAID6 30 20 10 0 0 10 20 30 40 File size (GB) Network Storage (GB/s) File (Gbps) (GB) avg. avg. Net read (sec) avg. max Disk write (sec) max avg. Overall (sec) max avg. 1 9.86 0.95 2.27 0.87 2.95 1.06 3.95 1.93 10 9.85 0.97 9.01 8.72 15.58 10.30 16.45 11.17 20 9.83 0.98 17.79 17.48 30.07 20.45 30.95 21.33 40 9.61 0.97 38.81 35.74 58.06 41.39 59.06 42.36 Djelloul Boukhelef - XFEL 25 Direct IO – internal storage 50 read write overall 40 Time(sec) • Internal disks: 14x900GB 10Krpm 6Gbps SAS RAID6 30 20 10 0 0 10 20 30 40 File size (GB) Network Storage (GB/s) File (Gbps) (GB) avg. avg. Net read (sec) avg. max Disk write (sec) max avg. Overall (sec) max avg. 1 9.86 1.17 3.42 0.87 2.47 0.86 4.23 1.73 10 9.60 1.09 9.38 8.95 11.28 9.19 12.15 10.07 20 9.36 1.06 19.18 18.36 22.63 18.84 23.55 19.75 40 9.38 1.07 38.11 36.64 45.25 37.54 46.22 38.50 Djelloul Boukhelef - XFEL 26 Long run experiments Internal storage: 9TB, 918 files, 4 hours Network read Disk write Overall time Average External storage: 28TB, 2792 files, 12 hours Network read Djelloul Boukhelef - XFEL Disk write Overall time Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16sec 27 Statistics from Ganglia Reader (core 0) • Long run experiment (5:24am - 5:29pm) – Host: exflst201 (with external storage) • Disk write: 671.14MB/s • Network bandwidth: 676.8MB/s • CPU usage: syst: 5.81% user: 0.39 % – Reader: syst: 49.54% – Writer: syst: 7.03% Network CPU user: 0.75% user: 0.010% Writer (core 1) Disk Djelloul Boukhelef - XFEL 28 Result summary • We need to write 1GB data file within 1.6s per storage box – Both internal and external storage configurations are able to achieve this rate (1.1GB/s, 0.97GB/s, resp.) – 16 storage boxes are needed to handle 10GB/s train data stream • High network bandwidth and low CPU load (stable) • Direct IO: – Network read and disk write operations are overlapped 97% Low overall time per file – Application buffer: For big files, the bigger the slot size the better disk IO performance (as long as DMA allows) • To do: – Concurrent IO operations: write/write, write/read, file merging – Storage manager: file indexing, disk space management Djelloul Boukhelef - XFEL 29 SUMMARY & OUTLOOK Djelloul Boukhelef - XFEL 30 Summary • First half of slice test hardware is configured and running • Testing and tuning network and i/o performance using – System/community tools: netperf, iozone, … – PCL software – Train builder board • TB (Emulator) PCL software (Dell) – Bandwidth: 9.9 Gbps (99% of the wire speed) – Low UDP packet loss rate: only few packets loss at the start of runs (3.5×108 ~ 5×109) less than 3.7~0.26 per 105 trains can be affected at most • PC Layer (Dell) Storage boxes (IBM) – TCP data streaming: ~9.8 Gbps – Write terabytes of data to disk at 0.97 to 1.1 GB/s speed Djelloul Boukhelef - XFEL 31 Outlook • Fully featured DAQ system – Data readout, pre-processing, monitoring, storage – Feed the system with real data and apply real algorithms (processing, monitoring, scientific computing) – Deployment, configuration and control: upload libraries, initiate devices, start/stop and monitor runs Device composition • Soak and stress testing – Test performance (CPU, IO, network), behavior (bugs, memory leaks), reliability (error handling, failure), and stability of the system • Significant workload applied over long period of time – Parallel tasks: forwarding data to online analysis or scientific computing, multiple streams into the same storage server, … • Cluster file system vs. DDN vs. local storage system • Data management: structure, access control, metadata, … Djelloul Boukhelef - XFEL 32 Thanks! Djelloul Boukhelef - XFEL 33