Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sphinx: A Scheduling Middleware for Data Intensive Applications on a Grid Richard Cavanaugh University of Florida Collaborators: Janguk In, Sanjay Ranka, Paul Avery, Laukik Chitnis, Gregory Graham (FNAL), Pradeep Padala, Rajendra Vippagunta, Xing Yan 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 1 The Problem of Grid Scheduling o Decentralised ownership o No one controls the grid o Heterogeneous composition o Difficult to guarantee execution environments o Dynamic availability of resources o Ubiquitous monitoring infrastructure needed o Complex policies o Issues of trust o Lack of accounting infrastructure o May change with time o Information gathering and processing is critical! 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 2 A Real Life Example o Merge two grids into a single multi-VO “inter-grid” UW UC UI ANL o How to ensure that BU UM MIT BNL FNAL IU LBL o neither VO is harmed? o both VOs actually benefit? o there are answers to questions like: Caltech UCSD OU UTA SMU Rice UF o “With what probability will my job be scheduled and complete before my conference deadline?” o Clear need for a scheduling middleware! 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 3 Some Requirements for Effective Grid Scheduling o Information requirements o Past & future dependencies of the application o Persistent storage of workflows o Resource usage estimation o Policies o Expected to vary slowly over time o Global views of job descriptions o Request Tracking and Usage Statistics o State information important 18.09.2003 o Resource Properties and Status o Expected to vary slowly with time o Grid weather o Latency measurement important o Replica management o System requirements o Distributed, fault-tolerant scheduling o Customisability o Interoperability with other scheduling systems o Quality of Service Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 4 Incorporate Requirements into a Framework VDT Client ? ? o Assume the GriPhyN Virtual Data Toolkit: ? VDT Server o Client (request/job submission) o Globus clients o Condor-G/DAGMan o Chimera Virtual Data System VDT Server VDT Server o Server (resource gatekeeper) o o o o 18.09.2003 Globus services RLS (Replica Location Service) MonALISA Monitoring Service etc Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 5 Incorporate Requirements into a Framework o Framework design principles: o Information driven o Flexible client-server model o General, but pragmatic and simple o Implement now; learn; extend over time o Avoid adding middleware requirements on grid resources ? VDT Client o Take what is offered! o Assume the GriPhyN Virtual Data Toolkit: Scheduler VDT Server o Client (request/job submission) o o o o Clarens Web Service Globus clients Condor-G/DAGMan Chimera Virtual Data System VDT Server VDT Server o Server (resource gatekeeper) o MonALISA Monitoring Service o Globus services o RLS (Replica Location Service) 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 6 The Sphinx Framework Clarens Sphinx Client WS Backbone Request Processing Chimera Virtual Data System Condor-G/DAGMan VDT Client Data Warehouse Data Management Information Gathering Sphinx Server 18.09.2003 Globus Resource Replica Location Service MonALISA Monitoring Service VDT Server Site Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 7 Sphinx Scheduling Server Control Process o Functions as the Nerve Centre o Data Warehouse o Policies, Account Information, Grid Weather, Resource Properties and Status, Request Tracking, Workflows, etc o Control Process Message Interface Graph Reducer Job Predictor Graph Predictor Job Admission Control Graph Admission Control Graph Data Planner Data Warehouse Job Execution Planner Graph Tracker o Finite State Machine o Different modules modify jobs, graphs, workflows, etc and change their state o Flexible o Extensible 18.09.2003 Data Management Information Gatherer Sphinx Server Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 8 Policy Constraints o Defined by Resource Providers o Actual grid sites (resource centres) o VO management o Applied to Request Submitters o VO, group, user, or even a proxy request (e.g. workflow) o Valid over a Period of Time o Can be dynamic (e.g. periodic) or constant o Global accounting and book-keeping is necessary 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 9 Quality of Service o For grid computing to become economically viable, a Quality of Service is needed o “Can the grid possibly handle my request within my required time window?” o If not, why not? When might it be able to accommodate such a request? o If yes, with what probability? o But, grid computing today typically: o Relies on a “greedy” job placement strategies o Works well in a resource rich (user poor) environment o Assumes no correlation between job placement choices o Provides no QoS 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 10 Quality of Service o As a grid becomes resource limited, o QoS becomes even more important! o “greedy” strategies may not be a good choice o Strong correlation between job placement choices o Sphinx is designed to provide QoS through time dependent, global views of o Requests (workflows, jobs, allocation, etc) o Policies o Resources 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 11 Resource Usage Estimation o User Requirements o Upper limits on CPU, memory, storage, bandwidth usage o Domain Specific Knowledge o Applications are often known to depend logarithmically, linearly, etc on certain input parameters, data size or type o Historical Estimates o Record the performance of all applications o Statistically estimate resource usage within some confidence level 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 12 Data Management o Smart Replication: o Graph based o Examine and insert replication nodes to minimise overall completion time o Distribute and collect required data o Particularly useful in data parallelism o “Hot Spot” based o Monitor current and historical data access patterns and replicate to optimise future access 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 13 Data Management o Smart Replication: o Graph based o Examine and insert replication nodes to minimise overall completion time o Distribute and collect required data o Particularly useful in data parallelism o “Hot Spot” based o Monitor current and historical data access patterns and replicate to optimise future access 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 14 Early Sphinx Prototype Test Results o Simple sanity checks o 120 canonical virtual data workflows submitted to US-CMS Grid o Round-robin strategy o Equally distribute work to all sites o Upper-limit strategy o Makes use of global information (site capacity) o Throttle jobs using just-in-time planning o 40% better throughput (given grid topology) o Conclusion: Prototype is working! 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 15 Some Current and Future Activities o o o o o o Policy Based Scheduling Quality of Service Graph Partitioning Data Parallelism Prediction Module Useful Views and Fusion of Monitoring Data 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 16 Conclusions o Scheduling on a grid has unique requirements o Information o System o Decisions based on global views providing a Quality of Service are important o Particularly in a resource limited environment o Sphinx is an extensible, flexible grid middleware which o Already implements many required features for effective global scheduling o Provides an excellent “workbench” for future activities! 18.09.2003 Data Mining and Exploration Middleware for Distributed and Grid Computing – University of Minnesota 17