Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Automated Diagnosis of Distributed Systems Performance Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Intersection of systems and ML/Data mining: Growing (research) area • Berkeley’s RAD lab (Reliable Adaptable Distributed systems lab) got $7.5mil from Google, Microsoft and Sun for: “…adoption of automated analysis techniques from Statistical Machine Learning (SML), control theory, and machine learning, to radically improve detection speed and quality in distributed systems” • Workshops devoted to area (e.g., SysML), papers in leading system and data mining conferences • Part of IBM’s “Autonomic Computing” and HP’s Adaptive Enterprise visions • Startups (e.g., Splunk, LogLogic) • And more… Ira Cohen - HP-Labs 2 SLIC project at HP-Labs*: Statistical learning inference and control •Research objective: Provide technology enabling automated decision making, management and control of complex IT systems. −Explore statistical learning, decision theory and machine learning as basis for automation. I’ll Focus today on Performance diagnosis *Participants/Collaborators: Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox, Steve Zhang, Jeff Chase, Rob Powers, Chengdu Huang, Blaine Nelson Ira Cohen - HP-Labs 3 Intuition: Why is performance diagnosis hard? • What do you do when your PC is slow? Ira Cohen - HP-Labs 4 Why care about performance? • Answer: It costs companies BIG money Analysts estimate that poor application performance costs U.S.-based companies approximately $27 billion each year • Performance management software products revenue growing at double digit % every year! Ira Cohen - HP-Labs 5 Challenges today in diagnosing/forecasting IT performance problems • Distributed systems/services are complex − Thousands of systems/services/applications is typical − Multiple levels of abstractions and interactions between components − Systems/Applications change rapidly • Multiple levels of responsibility (infrastructure operators, application operators, DBAs, …) --> a lot of finger pointing − Problems can take days/weeks to resolve • Loads of data, no actionable information − Operators manually search for needle in haystack − Multiple types of data sources --- lack of unifying tools to even view data • Operators hold past diagnosis efforts in their head history of diagnosis efforts mostly lost. Ira Cohen - HP-Labs 6 Translation to Machine Learning Challenges • Transforming data to information: Classification, feature selection methods – with need for explanation • Adaptation: Learning with concept drift • Leveraging history: Transforming diagnosis to an information retrieval problem, clustering methods, etc. • Using multiple data sources: combining structured and semi-structured data • Scalable machine learning solutions: distributed analysis, transfer learning • Using human feedback (human in the loop): semisupervised learning (active learning, semi-supervised clustering) Ira Cohen - HP-Labs 7 Outline • Motivation (already behind us…) • Concrete example: The state of distributed performance management today • ML challenges − examples of research results • Bringing in all together as a tool: Providing diagnostic capabilities as a centrally managed service • Discussion/Summary Ira Cohen - HP-Labs 8 Example: A real distributed HP Application architecture Geographically distribution 3-tier application Results shown today are from last 19+ months of data collected from this service Ira Cohen - HP-Labs 9 Application performance “management”: Service Level Objectives (SLO) Unhealthy = SLO Violation Ira Cohen - HP-Labs 10 Detection is not enough… • Triage: − What are the symptoms of the problem? − Who do I call? • Leverage history: − Did we see similar problems in the past? − What were the repair actions? − Do/Did they occur in other data centers? • Problem prioritization − How many different problems are there and their severity? − Which are recurrent? • Can we forecast these problems? Ira Cohen - HP-Labs Unhealthy 11 Challenge 1: Transforming data to information… • Many measurements (metrics) available on ITsystems (OpenView, Tivoli, etc…) − System/application metrics: CPU, memory, disk, network utilizations, queues, etc... − Measured on a regular basis (1-5 minutes with commercial tools). • Other semi-structure data (log files) Where is the relevant information? Ira Cohen - HP-Labs 12 ML Approach: Model using Classifiers Leverage all the data collected in the infrastructure to: 1) Use classifiers: F(M) -> SLO state 2) Classification accuracy is a measure of success 3) Use feature selection to find most predictive metrics of SLO state Unhealthy Ira Cohen - HP-Labs F(M ,SLO) 13 But we need an explanation, not just classification accuracy... Our approach: Learn joint probability distribution (Bayesian network classifiers) Unhealthy P(M,SLO) Inferences (“metric attribution”): Normal Metric has a value associated with healthy behavior Abnormal Metric has a value associated with unhealthy behavior Ira Cohen - HP-Labs P(M|SLO) 14 Bayesian network classifiers: Results • − Models takes 2-10 seconds to train on days worth of data − Metric attribution: Takes 1ms-10ms to compute SLO State M3 M5 M30 • Found that order of 3-10 metrics are needed (out of hundreds) to capture accurately a performance problem • Accuracy is high (~90%)* • Experiments showed metrics are useful for diagnosing certain problems on real systems • Hard to capture with single model multiple types of performance problems! M8 M32 Ira Cohen - HP-Labs “Fast”: (in the context of 1-5 mins data collection) 15 Additional issues • How much data is needed to get accurate models? • How to detect model validity? • How to present models/results to operators? Ira Cohen - HP-Labs 16 Challenge 2: Adaptation • Systems and application change • Reasons for performance problems change over time (and sometimes recur) Different? Same problem? Learning with “Concept drift” Ira Cohen - HP-Labs 17 Adaptation: Possible approaches • Single omniscient model: “Train once, use forever” − Assumes training data provides all information. • Online updating of model − E.g., parameter/structure updating of Bayesian networks, online learning of Neural networks, Support vector machines, etc. − Potentially wasteful retraining when similar problems reoccur • Maintain ensemble of models − Requires criteria for choosing subset of models in inference. − Criteria for adding new models to ensemble − Criteria for removing models from ensemble Ira Cohen - HP-Labs 18 Our approach: Managing an ensemble of models for our classification approach Construction 1. Periodically induce a new model 2. Check whether the model adds new information (classification accuracy) 3. Update the ensemble of models Inference: Use Brier score for selection of models Ira Cohen - HP-Labs 19 Adaptation: Results Single model: No Adaptation Single model trained with all history (no forgetting) Single model with sliding window Ensemble of Models • • Accuracy (%) 61.4 Total Processing Time (mins) 0.2 82.4 71.5 84.2 0.9 90.7 7.1 ~7500 samples, 5 mins/sample (one month), ~70 metrics Classifying a sample with the Ensemble of BNCs: − Used model with best Brier Score for predicting class (winner takes all) • Brier score was better than other measures (e.g., accuracy, likelihood) • Winner takes all was more accurate than other combination approaches (e.g., majority voting) Ira Cohen - HP-Labs 20 Adaptation: Result • “Single adaptive” slower to adapt to recurrent issues − Must re-learn behavior, instead of just selecting a previous model Ira Cohen - HP-Labs 21 Additional issues • Need criteria for “aging” models • Periods of “good” behavior also change: Need robustness to those changes as well. Ira Cohen - HP-Labs 22 Challenge 3: Leveraging history • It would be great to have the following system: Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: : : : Severity: SLO time increases up to 10secs : : Location: Americas. Not seen in Asia/Pacific Ira Cohen - HP-Labs 23 Leveraging history • Main challenge: Find a representation (signature) that captures the main characteristics of the system behavior that is: − Amenable to distance metrics − Generated automatically − In Machine readable form Ira Cohen - HP-Labs Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: : : : Severity: SLO time increases up to 10secs : 24 Our approach to defining signatures 1) Learn probabilistic classifiers Unhealthy Models P(SLO,M) 2) Inferences: Metric Attribution Abnormal metrics app cpu util app alive proc high app active proc high 3) Define these as signatures of the problems DB cpu util high Ira Cohen - HP-Labs 25 Example: Defining a signature • For a given SLO violation, the models provide a list of metrics that are attributed with the violation. • Metric has value 1 if it is attributed with the violation, -1 if it is not attributed, 0 if it is not relevant, e.g.: Attribution Ira Cohen - HP-Labs 26 Results: With signatures… • We were able to accurately retrieve past occurrences of similar performance problems with the diagnosis efforts • ML technique: Information retrieval Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: : : : Severity: SLO time increases up to 10secs : : Location: Americas. Not seen in Asia/Pacific Ira Cohen - HP-Labs 27 Results: Retrieval accuracy Retrieval of "Stuck Thread" problem Ideal P-R curve Top 100: 92 vs 51 Ira Cohen - HP-Labs 28 Results: With signatures we can also… • • • Automatically identify groups of different problems and their severity Identify which are recurrent ML technique: Clustering Ira Cohen - HP-Labs 29 Additional issues • Can we generalize and abstract signatures for different systems/applications? • How to incorporate human feedback for retrieval and clustering? − Semi-supervised learning: results not shown today Ira Cohen - HP-Labs 30 Challenge 4: Combining multiple data sources • We have a lot of semi-structured text logs, e.g., − Problem tickets − Event/error logs (application/system/security/network…) − Other logs (e.g., operators actions) • Logs can help obtain more accurate diagnosis and models – sometimes system/application metrics not enough • Challenges: − Transforming logs to “features”: information extraction − Doing it efficiently! Ira Cohen - HP-Labs 31 Properties of logs • Logs events have relatively short text messages • Much of the diversity in messages comes from different “parameters” – dates, machine/component names. Core is less unique compared to free text. • Amount of events can be huge (e.g., >100 million events per day for large IT systems) Processing events needs to compress logs significantly while doing it efficiently! Ira Cohen - HP-Labs 32 Our approach: Processing application error-logs 2006-02-26T00:00:06.461 ES_Domain:ES_hpat615_01:2257913: Thread43.ES82|commandchain.BaseE rrorHandler.logException()|FUNCT IONAL|0||FatalException occurred type=com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException, message=Connection timed out, class=com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUC ommand 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway sEIAErrorHandlerRed.handleException()|FATAL|2706||KNIGHT system unavailable: java.io.IOException 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem ent.knight.logic.RequestKnightResultMENUCommand message: Connection timed out causing exception type: java.io.IOException KNIGHT URL accessed: http://vccekntpro.cce.hp.com/knight/knightwarrantyservice.asmx 2006-02-26T00:00:06.466 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem ent.knight.logic.access.KnightIOException: Connection timed out 2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 'weblogic.kernel.Default'.ES82|com.hp.es.service.productEntitlement.combined.Mer geAllStartedThreadsCommand.setWaitingFinished()|WARNING|3709||2006-0226T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlway sEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem Over 4,000,000 error log entries 200,000+ distinct error messages • Similarity-based Sequential Clustering 190 “feature messages” Use count of appearances over 5minute intervals of the features messages as metrics for learning Significant reduction of messages − 200,000 190 • Accurate − Clustering results validated with hierarchical tree clustering algorithm Ira Cohen - HP-Labs 33 Learning Probabilistic Models Construct probabilistic models metrics using a “hybrid-gamma distribution” (Gamma distribution with zeros) PDF • # of appearances Ira Cohen - HP-Labs 34 Results: Adding Log based metrics • Signatures using error logs metrics pointed to the right causes in 4 out of 5 “High” severity incidents in past 2 months − System metrics were not related to the problems in these cases From Operator Incident Report: From Application Error Log: Diagnosis and Solution: Unable to start SWAT wrapper. Disk usage reached 100%. Cleaned up disk and restarted the wrapper… CORBA access failure: IDL:hpsewrapper/SystemNotAvailableException:… com.hp.es.wrapper.corba.hpsewrapper.SystemNotAvailableException Ira Cohen - HP-Labs 35 Additional issues • With multiple instances of an application – how to do joint, efficient processing of the logs? • Treating events as sequences in time could lead to more accuracy and compression. Ira Cohen - HP-Labs 36 Challenge 5: Scaling up Machine Learning techniques • Large scale distributed applications have various level of dependencies − Multiple instances of components − Shared resources (DB, network, software components) − Thousands to millions of metrics (features) A Ira Cohen - HP-Labs B C D E 37 Challenge 5: Possible approaches • Scalable approach: Ignore dependencies between components − Putting head in the sand? − See Werner Vogel’s (Amazon’s CTO) thoughts on it… • Centralized approach: Use all available data together for building models. − Not scalable • A different approach: Transfer models, not metrics. − Good for components that are similar and/or have similar measurements Ira Cohen - HP-Labs 38 Example: Diagnosis with Multiple Instances • Method 1: diagnosing multiple instances by sharing measurement data (metrics) A Ira Cohen - HP-Labs B 39 Diagnosis with Multiple Instances • Method 1: diagnosing multiple instances by sharing measurement data (metrics) G C A Ira Cohen - HP-Labs D E H F B 40 Diagnosis with Multiple Instances • Method 2: diagnosing multiple instances by sharing learning experience (models) − A form of transfer learning A Ira Cohen - HP-Labs B 41 Diagnosis with Multiple Instances • Method 2: diagnosing multiple instances by sharing learning experience (models) G A C D E H F B Ira Cohen - HP-Labs 42 Metric Exchange: Does it help? Online Prediction Violation detection w/ model exchange Violation detection w/o model exchange Online Prediction Building models based on metrics of other instancesInstance 1 Instance 2 • False Alarm Time Epoch • Time Epoch Observation: metric exchange does not improve model performance for load-balanced instances Ira Cohen - HP-Labs 43 Model Exchange: Does it help? Apply models trained on other instances Online Prediction • Models imported from other instances improve Violation detection accuracy Violation detection w/o model exchange w/ model exchange False alarm w/o model exchange False alarm w/ model exchange Time Epoch • Observation 1: model exchange enables quicker recognition of previously unseen problem types • Observation 2: model exchange reduces model training cost Ira Cohen - HP-Labs 44 Additional issues • How do/Can we do transfer learning on similar but not identical instances? • More efficient methods for detecting which data is needed from related components during diagnosis Ira Cohen - HP-Labs 45 Providing diagnosis as a web service: SLIC’s IT-Rover Monitored Services Metrics/SLO Monitoring Retrieval Signature construction Signature DB engine engine Clustering Admin engine Centralized diagnosis web service allows: •Retrieval across different data centers/different services/possibly different companies •Fast deployment of new algorithms •Better understanding of real problems for further development of algorithms •Value of portal is in the information (“Google” for systems) Ira Cohen - HP-Labs 46 Discussion: Additional issues, opportunities, and challenges • Beyond the “black box”: Using domain knowledge − Expert knowledge − Topology information − Use known dependencies and causal relationship between components • Provide solutions in cases where SLOs are not known − Learn relationship between business objectives and IT performance − Anomaly detection methods with feedback mechanisms • Beyond diagnosis: Automated control and decision making − HP-Labs work on applying adaptive controllers for controlling systems/applications − IBM Labs work using reinforcement learning for resource allocation Ira Cohen - HP-Labs 47 Summary • Presented several challenges at the intersection machine learning and IT automated diagnosis • A relatively new area for machine learning and data mining researchers and practitioners • Many more opportunities and challenges ahead: research and product/business wise… Read more: www.hpl.hp.com/research/slic − SOSP-05, DSN-05, HotOS-05, KDD-05, OSDI-04 Ira Cohen - HP-Labs 48 Publications: • • • • • • • • Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox, "Capturing, Indexing, Clustering, and Retrieving System History", SOSP 2005. Rob Powers, Ira Cohen, and Moises Goldszmidt, "Short term performance forecasting in enterprise systems", KDD 2005. Moises Goldszmidt, Ira Cohen, Armando Fox and Steve Zhang, "Three research challenges at the intersection of machine learning, statistical induction, and systems", HOTOS 2005. Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, Armando Fox, "Ensembles of models for automated diagnosis of system performance problems", DSN 2005. Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, Jeff Chase, "Correlating instrumentation data to system states: A building block for automated diagnosis and control", OSDI, 2004. George Forman and Ira Cohen, "Beware the null hypothesis", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2005. Ira Cohen and Moises Goldszmidt, "Properties and Benefits of Calibrated Classifiers", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004. George Forman and Ira Cohen, "Learning from Little: Comparison of Classifiers given Little Training", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004. Ira Cohen - HP-Labs 49