Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evaluating Classifiers in Adversarial Environments Alvaro Cárdenas Fujitsu Laboratories Dagstuhl Perspectives Workshop on Machine Learning Methods for Computer Security September 2012 Traditional ML Evaluation is not Applicable in Adversarial Settings In traditional ML practice, algorithms are trained and evaluated under assumptions that do not hold in a security environment You should not depend on attack examples If Classifier is widely adopted, “attack class” will change its behavior True positive rate must be evaluated with care! Training data might be poisoned by an attacker Large class imbalance between normal and attack events Previous Work on Classifier Evaluation Nelson, Joseph, Tygar, Rubinstein et.al. AsiaCCS 2006, AISec 2011, LEET 2008, IMC 2009, Taxonomy introduction, refinement, and applications Kloft, Laskov AISec 2009, AISTATS 2010 Analytical attacks for robust evaluation Biggio, Fumera, Giacinto, Roli, et.al, MCS 2009, 2011, IEEE SMC 2011 Empirical attacks for robust evaluation Focus: generating attacks against the classifier Missing: New metrics and worst undetected attacks Talk Outline Case example of Classifier Evaluation for Electricity Theft Detection Evaluating Classifiers Analytically Metric to account for imbalanced datasets Future work on big data Key Points of Electricity Theft Use-Case You cannot evaluate algorithms on detection rate, but rather on their effectiveness against the worst undetected attacks Ignore true positive rate metric • No ROC, PR curves, AUC, Accuracy, F-score, etc. Instead assume we always get attacked; & measure the cost of the worst undetected attacks Asymptotic behavior of data poisoning attacks More Info: Mashima, Cárdenas, RAID 2012 Cárdenas et.al. AsiaCCS 2011 – worst undetected attacks for industrial control systems Smart Grid Goals Efficiency Optimal use of assets: load shaping instead of load following Green: integrate renewable generation Reliability Real-time, fine-grained state of the grid used to anticipate faults and provide better control Customer Choice Transparency: Fine-grained energy usage, prices, proportion of green generation, etc. Smart appliances automated based on consumer preferences Advanced Metering Infrastructure (AMI) Replacing old mechanical electricity meters with new digital meters Enables frequent, periodic 2-way communication between utilities and homes GW Gateway Repeaters Smart Meter Data Collection Metering Server Motivation for ML in AMI for Security Push back in prices Billions of low-cost embedded devices Can’t have fancy tamper protection Security is hard to see But, Situational Awareness is Fun to see Understand the health of the system Identify anomalies AMI Gives a More Data on Electricity Consumption Construct models of “normal” consumption Data Analytics can identify suspicious behavior Focus on Electricity Theft Annex Parties: Developed Nations (Europe NA Japan) Non Annex Parties: The Rest. Source: Investment and Financial Flows To Address Climate Change. United Nations Attacks will happen: Devices are deployed for 20~30 years Anomaly Detection Architecture Smart Meters send consumption data frequently (e.g., every 15 minutes) to the utility Electricity Usage Consumer 1 Data Analytics, Anomaly Detection Meter Data Repository Router Fiber-‐op'c network Consumer n Collector Meters Router Storage Private Cloud Substa'on Houses Case Study: Detection of Electricity Theft Balance Meters Hardware: Tamper Evident Seals Detection of Electricity Theft Software: Usage Profiles Anomaly Detection Algorithms Trained without Attack Data Hypothesis Testing Unsupervised Learning We have prior knowledge of attack invariant Unlabeled data We know attackers want to lower energy consumption Include this information for the “bad” class H0 :P0 H1 :P s.t. E [Y ] < E0 [Y ] ARMA-GLR, CUSUM, EWMA Outliers Outlier Detection Algorithm Problems Easier to attack More false positives LOF Y1 , . . . , Yn ARMA GLR Detector We need to detect attack signals that are not only different from the historical ARMA model, but signals that lower the reported electricity consumption Given a sequence of observations, Y1 , . . . , Yn Calculate the likelihood of H0 (normal) and H1 (attack) Model H0 as an Autoregressive Moving Average Model (ARMA) Model H0 as an ARMA model P0 , such that: P H0 :P0 H1 :P s.t. E [Y ] < E0 [Y ] In ARMA models, a change in the mean can be modeled as: Yk+1 = under H0 : p i=1 Ai Yk i + = 0 and under q j=0 Bj (Vk j + H1 : = , ) > 0. Selecting the Attack Probability Model We do not know the magnitude of the attack We cannot compute the likelihood of H1 until we find a way to deal with this uncertainty Idea: Use the Generalized Likelihood Ratio (GLR) test: Among the class of “attack” distributions, find the one that best matches the observation Evaluation Most Machine Learning Algorithms Assume a pool of Negative Examples and a Pool of Positive examples to evaluate the tradeoff between false alarms vs. detection rate: Problem: We Do Not Have Positive Examples Because meters were just deployed, we do not have examples of “attacks” Our Proposal: Find the worst possible undetected attack for each classifier, and then find the cost (kWh Lost) of these attacks Adversary Model f(t) a(t) Real Consumption Fake Meter Readings Y1 , . . . , Yn Utility Ŷ1 , . . . , Ŷn n Goal of attacker: Minimize Energy Bill: min Ŷ1 ,...,Ŷn Ŷi i=1 Goal of Attacker: Not being detected by classifier “C”: C(Ŷ1 , . . . , Ŷn ) = normal 100 150 Real 50 Electricity Usage 200 250 Real vs. Attack Signals 0 Attack 0 20 40 Time Slot of Day 60 80 New Tradeoff Curve: No Detection Rates (can be extended to other fields) 15000 ARMA−GLR Average CUSUM EWMA LOF 10000 Average Loss per Attack [Wh] 20000 Y-axis: Cost of Undetected Attacks X-axis: False Positive Rate 0.00 0.05 0.10 0.15 False Positive Rate 0.20 0.25 False Alarms Because of Concept Drift Asymptotic Effects of Poisoning Attacks “Valid” Electricity Consumption Attacker can use undetected attacks to poison training data Undetected Attacks 400 300 200 We have to “retrain” models 100 Electricity consumption is a non-stationary distribution Consumption (0.1wh) 500 600 Online Learning Concept Drift 0 50 100 150 200 Time Time (Hours) Re-train Classifier to Account for Concept Drift 250 300 Detecting Poisoning Attacks Identify concept drift trends helping an attacker Lower electricity consumption over time. Countermeasure: linear regression of trend 1.0 0.8 0.6 0.4 Determination Coefficient 0.0 0.2 0 2 Determination Coeff. Slope of Regression Line −2 −4 −6 Slope of Regression Slope of regression was not good discriminant Determination coefficients worked! Honest Users Original Attackers Attack Honest Users Original Attackers Attack Talk Outline Case example of Classifier Evaluation for Electricity Theft Detection Evaluating Classifiers Analytically Game Theory and Adversarial Classification Oakland 2006, AAAI 2006, NIPS 2008, Infocom 2007, ToN 2009 Metric to account for imbalanced datasets Future work on big data Adversarial Classification is a Game Traditional ML Nature makes first move: • Statistics of classes Classifier makes second move: • Optimal classifier for the given statistical properties given Adversarial ML Classifier makes first move: • Optimal classifier for normal class and guessed attacks Attacker makes second move: • Modify attack after seeing classifier A lot of previous work on game theory I will focus on evaluation How Can We Ensure That Classifier Performance in Deployment is the Same (or Better) as Metric During Design? Minimax strategy is the safety level of the game for player 1: The smallest worst case Φ (error) an attacker can do to the system Step 1: Maximize Φ(D,A) over A (attack parameter) over a set of possible classifiers D Step 2: Minimize Φ(D,A*) over D Provable bound of worst performance… D is “secure” if for every A, Φ<m Example: Game Theory in ROC curves Goal: Select Classifier that Minimizes Probability of Error Attacker has control of prior Oakland 2006 Attacker Has Second Move If you select Classifier 1 (h1), Attacker selects p1 with Pr[Error |h1] = 0.3 If you select Classifier 2 (h2), Attacker selects p2 with Pr[Error |h2] = 0.4 Can we do better than selecting Classifier 1? Obtaining Full ROC Curve Known to Neyman-Pearson, 1933 ROCCH popularized by Provost, Fawcett, 2001 Intermission; “Real” ROC Curve Barreno, Cárdenas, Tygar, NIPS 2007/08 Theorem: In general, optimal ROC has where n is the number of classifiers rules ROC Randomization = Mixed Strategy In General, Easier to Reverse Optimization Order Minimax = Maximin iff saddle point Example: MAC-Layer Misbehavior ASUS Access Point Expected backoff distribution Centrino Digicom Dlink Dlink Linksys Bianchi et al. Infocom 2007 Adversary Model Lemma: The probability that the adversary accesses the channel is: >G Let Then the attack pmfs p1 must belong to the following set: Analytical form for optimal attack The optimal p1 is where r is the solution to Attackers have not figured out the optimal attack ASUS Expected Distribution Dlink Dlink Centrino Optimal-Attack Distribution Digicom Linksys SPRT outperforms previous solutions Expected time to detect misbehavior Expected time for a false positive Talk Outline Case example of Classifier Evaluation for Electricity Theft Detection Evaluating Classifiers Analytically Metric to account for imbalanced datasets Future work on big data Polygraph Test A national security organization with 10000 employees has one traitor (intruder) Assume Moe is tested with a 99% accurate Polygraph test: - Pr [ A=1 | I=1 ] = 0.99 (P ) - Pr [ A=0 | I=0 ] = 0.99 (1-P ) D F What is the probability that Moe is a traitor? Moe tests positive. What is the probability that Moe is a traitor? A) 0.99 B) 0.01 10000 employees; one traitor PD = Pr [ A=1 | I=1 ] = 0.99 PF = Pr [ A=1 | I=0 ] = 0.01 C) 0.50 D) ? The base-rate fallacy problem If Pr[ I=1 | A=1 ] = 0.01 sounds counterintuitive, you are exhibiting the baserate fallacy syndrome Ignore base-rates in your probability assessments Small base-rates: crying wolf phenomenon It is easy to lie with statistics - But it is easier to lie without them. Dan Geer Base-rate Fallacy in IDS: Axelsson, TISSEC 2000 Alternative Metric to Evaluate IDS? PF can be misinterpreted Count number of alarms? Good but heuristic Replace ROC with graph with posterior probability Problems: • How do you maximize the posterior probability? • Uncertain p (probability of attack) Precision-Recall curves Are always computed empirically with base-rate from data-set. Evaluating the Performance of Intrusion Detection Systems Metric ROC Field Signal Processing Cost sensitive eval. (Bayes risk) Decision Theory/Operations Research Intrusion Detection Capability CID Information Theory Bayesian Detection Rate Pr[I|A]=PPV Distinguishability Sensitivity Statistics Cryptography New Metric: IDOC (B-ROC) Tradeoff between PD=Pr[ A=1 | I=1 ] vs. Pr [ I=1 | A=1 ] for different base-rates PD = Pr[ A=1 | I=1 ] Pr[ I=1 | A=1 ] Talk Outline Case example of Classifier Evaluation for Electricity Theft Detection Evaluating Classifiers Analytically Metric to account for imbalanced datasets Future work on big data CSA: Big Data Working Group CSA Big Data Working Group Site https://cloudsecurityalliance.org/research/bigdata/ CSA, Big Data LinkedIn http://www.linkedin.com/groups?home=&gid=4458215&trk=anet_ug_hm Basecamp Project Collaboration Site Request Form https://cloudsecurityalliance.org/research/ basecamp/ 5 Research Directions Big data analytics for security intelligence Privacy preserving/enhancing technologies Big data-scale crypto Big data cloud infrastructure and attack surface reduction Policy and governance The Road to Better Situational Awareness Intrusion Detection Systems Network flows, HIDS logs SIEM Alarm Correlation Big Data Security/Analytics Variety of Data, Security Intelligence What is new in Big Data? Traditional Systems Big Data Promise More rigid, predefined schemas Structured and unstructured data treated seamlessly Data gets deleted Keep data for historical correlation (e.g., 10 years) Complex analyst queries take long to complete Faster query response times Others? Others? Hadoop is de facto open standard for big data at rest Stream processing? Participation welcome!! How do we Achieve these Objectives Academic audience: How do I convince my peers that the evaluation of a classifier is: 1. Technically sound 2. Follows scientific method that considers actions of an attacker Industry audience How do I convince customers/industry partners that our technology is valuable Business case for investing in new technology Value proposition