Download RM3G: Next Generation Recovery Manager

RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University Design Goals  Overall Goal: Manage the detection of and recovery from system failures  New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection    Previous generation used End-2-End and Exception monitors SLTs RM3G Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in  Standardize the APIs for observation, analysis, and control of system components  Provide common services and abstractions to SLT algorithms Comp RM itself must also be resilient to failures © 2004 Steve Zhang RADS Architecture User Operator Client Server Distributed Middleware SLT Services (RM3G) Distributed Middleware PNE Edge Network ApplicationSpecific Overlay Network EdgePNE Network Router Router Commodity Internet & IP networks © 2004 Steve Zhang Design Diagram SLT Processes Comp B Spawned by SLT Proc Srv Comp C Comp A SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv Ctrl/Obsrv point descriptors Control policies RM Proc Srv Observation Points RMDB Name & Reg Srv Control Points © 2004 Steve Zhang Collaboration with ACME    Infrastructure for monitoring, analyzing, and controlling Internet-scale systems  Sensors = Observation Points  Actuators = Control Points RM potentially benefits from two ACME features  An in-network aggregator combines data from sensors as they are routed through an overlay network  Configuration language that specifies under what conditions to trigger actuators ACME could benefit from more powerful sensor data analysis using SLTs © 2004 Steve Zhang Observation Points  We want to avoid requiring every component to be individually instrumented   Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) Several types of observation data can be collected in an application generic way  OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc)  Middleware can provide intra-application data (e.g. interaction between different components of an application) © 2004 Steve Zhang SLT Data Services    Abstracts information from observation points  SLT algorithms are spawned for each component in the system, as they are instantiated  Observation data stored by SLT Data Server possibly in a streaming database. Listens for feedback from SLT algorithms to adjust the data stream as necessary  Increase data sampling rate if anomaly is suspected  Stop reporting certain data if it is deemed to be irrelevant Provide persistent data storage for SLT algorithms  Remember properties learned from previous analysis of observation data © 2004 Steve Zhang Control Points  Assumes crash-only components    Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) Initially, only restart control points are supported  Instrument application server (JBoss) to restart applications and application components  OS can restart application servers  IP addressable power strips can restart entire nodes Components can specify custom control policy  Leverage ACME’s configuration language © 2004 Steve Zhang Future Work  “Master” SLT   Support additional types of control points   Multiple level settings that tune component parameters (e.g. filter level) Support additional types of observation points   Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way Online SLT algorithms for anomaly detection are not mature © 2004 Steve Zhang

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download RM3G: Next Generation Recovery Manager