Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer [email protected] Motivation Fast failure detection and diagnosis are critical to high availability – But, exact root cause may not be required for many recovery techniques Many potential causes of failures – Software bugs, hardware, configuration, network, database, etc. – Manual diagnosis is slow and inconsistent Statistical approaches are ideal – Simultaneously examining many possible causes of failures – Robust to noise Jan 12, 2004 Path-based Diagnosis Slide 2 Challenges Lots of (noisy) data Near real-time detection and diagnosis Multiple independent failures Root cause might not be captured in logs Jan 12, 2004 Path-based Diagnosis Slide 3 Talk Outline Introduction eBay’s infrastructure 3 statistical approaches Early results Jan 12, 2004 Path-based Diagnosis Slide 4 eBay’s Infrastructure 2 physical tiers – Web server/app server + DB – Migrating to Java (WebSphere) from C++ SuperCAL (Centralized Application Logging) – API for app developer to log anything to CAL – Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc. – Supports nested txns – A path can be identified via thread ID + host ID Jan 12, 2004 Path-based Diagnosis Slide 5 SuperCAL Architecture detection App Servers LB Switch diagnosis …… Real-time msg bus Stats – 2K app servers, 40 SuperCAL machines – 1B URLs/day – 1TB raw logs/day (150GB gzipped), 200Mbps peak Jan 12, 2004 Path-based Diagnosis Slide 6 Failure Analysis Summarize each transaction into: ID Type Name FeaturesPool Host Version DB Status 1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, … NullPointer 2 URL Bid Cgi2 231 1.0.3 PriceDB Success 3 XML … … … … … … Class What features are causing requests to fail? – Txn type, txn name, pool, host, version, DB, or a combination of these? – Different causes require different recovery techniques Jan 12, 2004 Path-based Diagnosis Slide 7 3 Approaches Machine learning – Decision trees – MinEntropy – eBay’s greedy variant of decision trees Data mining – Association rules Jan 12, 2004 Path-based Diagnosis Slide 8 Decision Trees Classifiers developed in the statistical machine learning field Example: go skiing tomorrow? New snow No new snow Y Sunny Cloudy Y Sunny Y Cloudy N New snow Y No new snow N “learning” => inferring the decision trees rules from data Jan 12, 2004 Path-based Diagnosis Slide 9 Decision Trees Feature selection – Look for features that best separates the classes – Different algorithms uses different metrics to measure “skewness” (e.g. C4.5 uses information gain) TxnName Failed Machine Failed MyEBay 636 Attila 2985 MyEBaySeller 512 Lenin 20 MyEBayLogin 736 Marcus 4 Scipio 5 … … … … The goal of decision tree algorithm – to split nodes until leaves are “pure” enough or until no further split is possible • i.e. pure => all data points have the same class label – Use pruning heuristics to control over-fitting Jan 12, 2004 Path-based Diagnosis Slide 10 Decision Trees – Sample Output Pool = icgi1 (Correct, incorrect) | TxnName = LeaveFeedback: failed (8,1) | TxnName = MyFeedback: failed (205,3) Pool = icgi2 | TxnName = Respond: failed (1) | TxnName = ViewFeedback: failed (3554,52) Naïve diagnosis: icgi1 LeaveFdbk 8 MyFdbk ViewFdbk 205 Jan 12, 2004 1. Pool=icgi1 and TxnName=LeaveFeedback icgi2 1 2. Pool=icgi1 and TxnName=MyFeedback Respond 3. Pool=icgi2 and TxnName=Respond 4. Pool=icgi2 and TxnName=ViewFeedback 3554 Path-based Diagnosis Slide 11 Feature Selection Heuristics 1. Ignore leaf nodes with no failed transactions 2. Problem: noisy leaves – keep the top N leaves, or ignore nodes with < M% failues 3. Problem: features may not be independent – drop ancestor nodes that are “subsumed” by the leaves 4. Rank by impact – sort the predicted causes by failure count icgi1 icgi2 LeaveFdbk MyFdbk ViewFdbk 8 205 Jan 12, 2004 1 icgi1 Respond 3554 icgi2 MyFdbk Respond MyFdbk Respond 205 Path-based Diagnosis 3554 205 3554 Slide 12 MinEntropy Entropy measures the randomness of data – E.g. if failure is evenly distributed (very random), then entropy is high Rank features by the normalized entropy – Greedy approach searches for the leaf node with most failures Always produces one and exactly one diagnosis Deployed on the entire eBay site – Sends real-time alerts to ops – Pros: fast (<1s for 100K txns and scales linearly) – Cons: optimized for single faults Jan 12, 2004 Path-based Diagnosis Slide 13 MinEntropy example TxnType Errors URL 4350 SQL 47 EMAIL XSLT … 12 0 … Errors MyEBay 636 MyEBaySel ler 512 Pool Errors Cgi0 12 4002 MyEBayLo gin 736 Cgi1 Cgi2 30 … … Cgi3 8 Cgi4 5 … … Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1 Jan 12, 2004 TxnName Machine Errors Attila 1985 Lenin 2002 Marcus 4 Scipio 0 … … Path-based Diagnosis Version Errors E293 3987 E291 15 Slide 14 Association Rules Data mining technique to compute item sets – e.g. Shoppers who bought this item also shopped for … Metrics – Confidence: (# of A & B) / # of A • Conditional probability of B given A – Support: (# of A & B)/total # of txns Generates rules for all possible sets – e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02) Applied to failure diagnosis – Find all rules that has failed status on the right, then rank by conf – Pros: looks at combinations of features – Cons: generates many rules Jan 12, 2004 Path-based Diagnosis Slide 15 Association Rules – Sample Output Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Problem: features may not be independent – e.g. all LeaveFeedback txns are of type URL – Drop rules that are subsumed by more specific rules Diagnosis: TxnName=LeaveFeedback Jan 12, 2004 Path-based Diagnosis Slide 16 Experimental Setup Dataset – About 1/8 of the whole site Type Name Pool Machine Version Database Status 10 300 15 260 7 40 8 – 10 one-minute traces, 4 with 2 concurrent faults • total of 14 independent faults Host DB Host, Host Host, DB Host, SW DB, SW 2 4 1 1 1 1 – True faults identified through post-mortems, ops chat logs, application logs, etc. Metrics – Precision: (# of identified faults) / (# of true faults) – Recall: (# of identified faults) / (# of predicted faults) Jan 12, 2004 Path-based Diagnosis Slide 17 Results: DBs in Dataset True causes for DB-related failures are captured in the dataset – Variable number of DBs used by each txn Feature selection heuristics 1. Ignore leaf nodes with no failed transactions 2. Noise filtering – ignore nodes with < M% failues (in this case, M = 10) 100% 80% 60% 40% 20% 0% 3. Path trimming – recall precision C4.5 naïve C4.5 (noise filtering) C4.5 (noise filtering + path trim m ing) drop ancestor nodes subsumed by the leaf nodes Jan 12, 2004 Path-based Diagnosis Slide 18 Results: DBs not in Dataset True cause not captured for DB-related failures 100% precision recall 80% 60% 40% 20% 0% C4.5 MinEntropy Association Association Rules (N=5) Rules (N=10) C4.5 suffers from unbalanced dataset – i.e. produces a single-rule that predicts every txn to be successful Jan 12, 2004 Path-based Diagnosis Slide 19 What’s next? ROC curves – show tradeoff between precision and recall Transient failures – Up-sample to balance dataset or use cost matrix Some measure of the “confidence” of the prediction More data points – Have 20hrs of logs that have failures Jan 12, 2004 Path-based Diagnosis Slide 20 Open Questions How to deal with multiple symptoms? – E.g. DB outage causing multiple types of requests to fail – Treat it as multiple failures? Failure importance (count vs. rate) – Two failures may have similar failure count – Low volume and higher failure rate vs. high volume and lower failure rate Jan 12, 2004 Path-based Diagnosis Slide 21