Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hyper-Scaling Xrootd Clustering Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 29-September-2005 http://xrootd.slac.stanford.edu Root 2005 Users Workshop CERN September 28-30, 2005 Outline Xrootd Single Server Scaling Hyper-Scaling via Clustering Architecture Performance Configuring Clusters Detailed relationships Example configuration Adding fault-tolerance Conclusion 29-September-05 2: http://xrootd.slac.stanford.edu Latency Per Request (xrootd) 29-September-05 3: http://xrootd.slac.stanford.edu Capacity vs Load (xrootd) 29-September-05 4: http://xrootd.slac.stanford.edu xrootd Server Scaling Linear scaling relative to load Allows deterministic sizing of server Disk NIC Network Fabric CPU Memory Performance tied directly to hardware cost 29-September-05 5: http://xrootd.slac.stanford.edu Hyper-Scaling xrootd servers can be clustered Increase access points and available data Complete scaling Allow for automatic failover Comprehensive fault-tolerance The trick is to do so in a way that Cluster overhead (human & non-human) scales linearly Allows deterministic sizing of cluster Cluster size is not artificially limited I/O performance is not affected 29-September-05 6: http://xrootd.slac.stanford.edu Basic Cluster Architecture Software cross bar switch Allows point-to-point connections Client and data server I/O performance not compromised Assuming switch overhead can be amortized Scale interconnections by stacking switches Virtually unlimited connection points Switch overhead must be very low 29-September-05 7: http://xrootd.slac.stanford.edu Single Level Switch A open file X Redirectors Cache file location 2nd open X go to C Who has file X? B go to C Client Redirector (Head Node) C Data Servers Cluster Client sees all servers as xrootd data servers 29-September-05 8: http://xrootd.slac.stanford.edu Two Level Switch Client A Data Servers open file X B D C Supervisor E (sub-redirector) F go to C Redirector (Head Node) open file X go to F Cluster Client sees all servers as xrootd data servers 29-September-05 9: http://xrootd.slac.stanford.edu Making Clusters Efficient Cell size, structure, & search protocol are critical Cell Size is 64 Limits direct inter-chatter to 64 entities Compresses incoming information by up to a factor of 64 Can use very efficient 64-bit logical operations Hierarchical structures usually most efficient Cells arranged in a B-Tree (i.e., B64-Tree) Scales 64h (where h is the tree height) Client needs h-1 hops to find one of 64h servers (2 hops for 262,144 servers) Number of responses is bounded at each level of the tree Search is a directed broadcast query/rarely respond protocol Provably best scheme if less than 50% of servers have the wanted file Generally true if number of files >> cluster capacity Cluster protocol becomes more efficient as cluster size increases 29-September-05 10: http://xrootd.slac.stanford.edu Cluster Scale Management Massive clusters must be self-managing Scales 64n where n is height of tree Scales very quickly (642 = 4096, 643 = 262,144) Well beyond direct human management capabilities Therefore clusters self-organize Single configuration file for all nodes Uses a minimal spanning tree algorithm 280 nodes self-cluster in about 7 seconds 890 nodes self-cluster in about 56 seconds Most overhead is in wait time to prevent thrashing 29-September-05 11: http://xrootd.slac.stanford.edu Clustering Impact Redirection overhead must be amortized This is deterministic process for xrootd All I/O is via point-to-point connections Can trivially use single-server performance data Clustering overhead is non-trivial 100-200us additional for an open call Not good for very small files or short “open” times However, compatible with the HEP access patterns 29-September-05 12: http://xrootd.slac.stanford.edu Detailed Cluster Architecture A cell is 1-to-64 entities (servers or cells) clustered around a cell manager The cellular process is self-regulating and creates a B-64 Tree Head Node 29-September-05 M 13: http://xrootd.slac.stanford.edu xrootd The Internal Details ofs odc xrootd olbd Control Network Managers, Supervisors & Servers (resource info, file location) Redirectors olbd M Data Network (redirectors steer clients to data Data servers provide data) ctl xrootd Data Clients 29-September-05 olb olbd S data xrootd Data Servers 14: http://xrootd.slac.stanford.edu xrootd Schema Configuration ofs odc olb Redirectors Supervisors Data Servers (Head Node) ofs.redirect remote odc.manager host port (sub-redirector) (end-node) ofs.redirect remote ofs.redirect target ofs.redirect target olb.role manager olb.port port olb.allow hostpat olb.role supervisor olb.subscribe host port olb.allow hostpat 29-September-05 15: http://xrootd.slac.stanford.edu olb.role server olb.subscribe host port Example: SLAC Configuration kan01 kan02 kan03 kanrdr01 kan04 kanrdr02 kanxx kanrdr-a client machines Hidden Details 29-September-05 16: http://xrootd.slac.stanford.edu xrootd Configuration File ofs odc if kanrdr-a+ olb.role manager olb.port 3121 olb.allow host kan*.slac.stanford.edu ofs.redirect remote odc.manager kanrdr-a+ 3121 else olb.role server olb.subscribe kanrdr-a+ 3121 ofs.redirect target fi 29-September-05 17: http://xrootd.slac.stanford.edu olb xrootd Potential Simplification? if kanrdr-a+ olb.role manager olb.port 3121 olb.allow host kan*.slac.stanford.edu ofs.redirect remote odc.manager kanrdr-a+ 3121 else olb.role server olb.subscribe kanrdr-a+ 3121 ofs.redirect target fi ofs odc olb.port 3121 all.role manager if kanrdr-a+ all.role server if !kanrdr-a+ all.subscribe kanrdr-a+ olb.allow host kan*.slac.stanford.edu Is the simplification really better? We’re not sure, what do you think? 29-September-05 olb 18: http://xrootd.slac.stanford.edu Adding Fault Tolerance xrootd xrootd olbd olbd xrootd xrootd xrootd olbd olbd olbd Data Server xrootd xrootd (Leaf Node) olbd olbd Manager (Head Node) Supervisor (Intermediate Node) Fully Replicate Hot Spares Data Replication Restaging Proxy Search* ^xrootd has builtin proxy support today; discriminating proxies will be available in a near future release. 29-September-05 19: http://xrootd.slac.stanford.edu Conclusion High performance data access systems achievable The devil is in the details High performance and clustering are synergetic Allows unique performance, usability, scalability, and recoverability characteristics Such systems produce novel software architectures Challenges Creating applications that capitilize on such systems Opportunities Fast low cost access to huge amounts of data to speed discovery 29-September-05 20: http://xrootd.slac.stanford.edu Acknowledgements Fabrizio Furano, INFN Padova Client-side design & development Principal Collaborators Alvise Dorigo (INFN), Peter Elmer (BaBar), Derek Feichtinger (CERN), Geri Ganis (CERN), Guenter Kickinger (CERN), Andreas Peters (CERN), Fons Rademakers (CERN), Gregory Sharp (Cornell), Bill Weeks (SLAC) Deployment Teams FZK, DE; IN2P3, FR; INFN Padova, IT; CNAF Bologna, IT; RAL, UK; STAR/BNL, US; CLEO/Cornell, US; SLAC, US US Department of Energy Contract DE-AC02-76SF00515 with Stanford University 29-September-05 21: http://xrootd.slac.stanford.edu