Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University Design Goals Overall Goal: Manage the detection of and recovery from system failures New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection Previous generation used End-2-End and Exception monitors SLTs RM3G Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in Standardize the APIs for observation, analysis, and control of system components Provide common services and abstractions to SLT algorithms Comp RM itself must also be resilient to failures © 2004 Steve Zhang RADS Architecture User Operator Client Server Distributed Middleware SLT Services (RM3G) Distributed Middleware PNE Edge Network ApplicationSpecific Overlay Network EdgePNE Network Router Router Commodity Internet & IP networks © 2004 Steve Zhang Design Diagram SLT Processes Comp B Spawned by SLT Proc Srv Comp C Comp A SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv Ctrl/Obsrv point descriptors Control policies RM Proc Srv Observation Points RMDB Name & Reg Srv Control Points © 2004 Steve Zhang Collaboration with ACME Infrastructure for monitoring, analyzing, and controlling Internet-scale systems Sensors = Observation Points Actuators = Control Points RM potentially benefits from two ACME features An in-network aggregator combines data from sensors as they are routed through an overlay network Configuration language that specifies under what conditions to trigger actuators ACME could benefit from more powerful sensor data analysis using SLTs © 2004 Steve Zhang Observation Points We want to avoid requiring every component to be individually instrumented Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) Several types of observation data can be collected in an application generic way OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc) Middleware can provide intra-application data (e.g. interaction between different components of an application) © 2004 Steve Zhang SLT Data Services Abstracts information from observation points SLT algorithms are spawned for each component in the system, as they are instantiated Observation data stored by SLT Data Server possibly in a streaming database. Listens for feedback from SLT algorithms to adjust the data stream as necessary Increase data sampling rate if anomaly is suspected Stop reporting certain data if it is deemed to be irrelevant Provide persistent data storage for SLT algorithms Remember properties learned from previous analysis of observation data © 2004 Steve Zhang Control Points Assumes crash-only components Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) Initially, only restart control points are supported Instrument application server (JBoss) to restart applications and application components OS can restart application servers IP addressable power strips can restart entire nodes Components can specify custom control policy Leverage ACME’s configuration language © 2004 Steve Zhang Future Work “Master” SLT Support additional types of control points Multiple level settings that tune component parameters (e.g. filter level) Support additional types of observation points Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way Online SLT algorithms for anomaly detection are not mature © 2004 Steve Zhang