Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Grid in action: from EasyGrid to LCG testbed and gridification techniques. James Cunha Werner University of Manchester Christmas Meeting - 2005 Going to grid Conventional way: • Usual code (your cuts) • Run BetaMiniApp in several data files one after the other. • When all data is done, you have results! Grid way: • Same usual code (your cuts) • Run several copies of BetaMiniApp, each running in one data file independent. • At the end, join all results! EasyGrid does it for you! General overview Users’ software EasyGrid for datasets Gridification algorithms for generic soft Grid testbed EasyTau for selected events EasyGrid: an overview • Prototype for future development. RPA = guarantee of useful software • Provide all support for job submission system: – Recovers results in users’ directory – Generates reports for further analysis (aborts and abends) in one history file. • It is a Framework users can adapt to their own needs and applications. • Fully operational and integrated with LCG. ./easygrid dataset_name Christmas 2004: My goals were… • develop a submission system fail proof. • write web pages with all elementary tasks in HEP/Babar, to help students and newbie. • Understand q-qbar interaction through Pi0. What I have achieved in 2005… Achievements with EasyGrid • Friendly user framework, flexible and reliable. It provides users with results, or necessary information for further analysis. • Tutorial web pages for PhD students and new researchers. http://www.hep.man.ac.uk/u/jamwer • Pi0 Project: analysis of 500 million events and 5 Million Monte Carlo generation in 5 weeks. http://www.hep.man.ac.uk/u/jamwer/pi0alg5.html • Anti-deuteron project: 1,500 Million events in 1 week, running in several sites in UK. More than 200 jobs in parallel. http://www.hep.man.ac.uk/u/jamwer/deutdesc.html LCG Installation and debug • There are several problems in LCG grid: – high number of jobs fail when running more than 200 jobs. – installation issues. – performance issues. • Installation of a complete testbed from scratch using 10 obsolete computers: http://www.hep.man.ac.uk/u/jamwer/#sec0 Testbed stress test Processing time is zero: BetaMiniApp replaced by program to print dataset name and wait some time (e.g. 300 s). 1,000 jobs submitted every time at 6 WNs testbed. T0 T1 T2 Sub Fail 0 Aborts (1) 84 122 0 Bf33 296 144 6 Bf34 306 148 161 0 Number of jobs/WN 0 •T0 and T1: Time between submissions is zero (continuous flow). •T0: WN bf36, bf37, bf38 were without pbs_mom started •T1: 1 WN crashed during test (2). Bf35 314 156 195 Bf36 0 165 211 •T2: time between submissions: 30 s. CE (bf32) CPU use was >90%. Bf37 0 172 213 Bf38 0 91 (2) 214 (1) Cannot plan: BrokerHelper: no compatible resources Recommendations CE are very required in Grid (>90% CPU load!) and affects grid performance: • The number of WNs for each CE can be defined by the minimum value of submission delay and minimum queue time. • Run one CE for large farms is a limiting factor. More matched CEs per RB would reduce failure and increase performance. • File system study will provide more information soon. Research in Gridification technologies for conventional software • Users expend years developing their source code, and they will not throw away just to use web services. • I developed an algorithm that will allow users use their own software on top of a web service layer with LCG middleware. • Preliminary tests using “fake” web services (simulated with PVM) show it is a viable and flexible approach. Gridification algorithm • Creates parallel processes using PVM with ssh remote shell. • There is a central job, with distributes tasks over parallel processes, when slaves processes return results. No need for load balancing! • Controls slaves failures and resubmission to available slaves. There is not a checkpoint system (not worth). • Transfer time can be a bottleneck. Task streams implemented. Results with 300 empty processes in one laptop show a transfer time of 185 ms/process. Conclusion • EasyGrid is operational. Benchmarks were a proof-of-concept under real conditions. • LCG testbed is operational, providing results, and supporting performance analysis and tuning. • Gridification algorithm is running in one Laptop with Genetic Programming/AI. New year resolution • Analysis of linux kernel related file server issues. • LCG Performance study and Linux kernel tuning. • Implementation of EasyTau: a submission module for TauUser package using EasyGrid (running on ntuples). • Gridification algorithm running with LCG and commercial applications (WebSphere, Tivoli, Symphony, etc) • EasyGrid Product development and startup. • Run pi0 project again with EasyGrid Product and maybe … publish a paper about gridification! Happy new year!