Download Evaluating Classifiers in Adversarial Environments: A Case Example with Smart Grid Data. (pdf)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Transcript
Evaluating Classifiers in
Adversarial Environments
Alvaro Cárdenas
Fujitsu Laboratories
Dagstuhl Perspectives Workshop on
Machine Learning Methods for Computer Security
September 2012
Traditional ML Evaluation is not Applicable in
Adversarial Settings
 In traditional ML practice, algorithms are trained and
evaluated under assumptions that do not hold in a
security environment
 You should not depend on attack examples
 If Classifier is widely adopted, “attack class” will change its
behavior
 True positive rate must be evaluated with care!
 Training data might be poisoned by an attacker
 Large class imbalance between normal and attack
events
Previous Work on Classifier Evaluation
 Nelson, Joseph, Tygar, Rubinstein et.al.
 AsiaCCS 2006, AISec 2011, LEET 2008, IMC 2009,
 Taxonomy introduction, refinement, and applications
 Kloft, Laskov
 AISec 2009, AISTATS 2010
 Analytical attacks for robust evaluation
 Biggio, Fumera, Giacinto, Roli, et.al,
 MCS 2009, 2011, IEEE SMC 2011
 Empirical attacks for robust evaluation
 Focus: generating attacks against the classifier
 Missing: New metrics and worst undetected
attacks
Talk Outline
 Case example of Classifier Evaluation for
Electricity Theft Detection
 Evaluating Classifiers Analytically
 Metric to account for imbalanced datasets
 Future work on big data
Key Points of Electricity Theft Use-Case
 You cannot evaluate algorithms on detection rate, but
rather on their effectiveness against the worst
undetected attacks
 Ignore true positive rate metric
• No ROC, PR curves, AUC, Accuracy, F-score, etc.
 Instead assume we always get attacked; &
measure the cost of the worst undetected attacks
 Asymptotic behavior of data poisoning attacks
 More Info:
 Mashima, Cárdenas, RAID 2012
 Cárdenas et.al. AsiaCCS 2011 – worst undetected
attacks for industrial control systems
Smart Grid Goals
 Efficiency
 Optimal use of assets: load shaping
instead of load following
 Green: integrate renewable generation
 Reliability
 Real-time, fine-grained state of the grid
used to anticipate faults and provide
better control
 Customer Choice
 Transparency: Fine-grained energy
usage, prices, proportion of green
generation, etc.
 Smart appliances automated based on
consumer preferences
Advanced Metering Infrastructure (AMI)
 Replacing old mechanical electricity meters
with new digital meters
 Enables frequent, periodic 2-way
communication between utilities and homes
GW
Gateway
Repeaters
Smart Meter
Data Collection
Metering Server
Motivation for ML in AMI for Security
 Push back in prices
 Billions of low-cost embedded devices
 Can’t have fancy tamper protection
 Security is hard to see
 But, Situational Awareness is Fun to see
 Understand the health of the system
 Identify anomalies
 AMI Gives a More Data on Electricity
Consumption
 Construct models of “normal” consumption
 Data Analytics can identify suspicious behavior
Focus on Electricity Theft
Annex Parties: Developed
Nations (Europe NA Japan)
Non Annex Parties:
The Rest.
Source: Investment and Financial Flows To Address Climate Change. United Nations
Attacks will happen:
Devices are deployed
for 20~30 years
Anomaly Detection Architecture
Smart Meters send consumption data
frequently (e.g., every 15 minutes) to
the utility
Electricity Usage
Consumer 1
Data Analytics,
Anomaly Detection
Meter Data
Repository
Router Fiber-­‐op'c network Consumer n
Collector Meters Router Storage Private Cloud Substa'on Houses Case Study: Detection of Electricity Theft
Balance Meters
Hardware:
Tamper Evident
Seals
Detection of
Electricity Theft
Software:
Usage Profiles
Anomaly
Detection
Algorithms Trained without Attack Data
Hypothesis
Testing
Unsupervised
Learning
 We have prior knowledge
of attack invariant
Unlabeled data
 We know attackers want to
lower energy consumption
 Include this information for
the “bad” class
H0 :P0
H1 :P s.t. E [Y ] < E0 [Y ]
 ARMA-GLR, CUSUM, EWMA
Outliers
Outlier Detection
Algorithm
  Problems
  Easier to attack
  More false positives
 LOF
Y1 , . . . , Yn
ARMA GLR Detector
 We need to detect attack signals that are not only different
from the historical ARMA model, but signals that lower the
reported electricity consumption
 Given a sequence of observations,
Y1 , . . . , Yn
 Calculate the likelihood of H0 (normal) and H1 (attack)
 Model H0 as an Autoregressive Moving Average Model (ARMA)
 Model H0 as an ARMA model
P0 , such that: P
H0 :P0
H1 :P s.t. E [Y ] < E0 [Y ]
 In ARMA models, a change in the mean can be modeled as:
Yk+1 =
under H0 :
p
i=1
Ai Yk i +
= 0 and under
q
j=0 Bj (Vk j +
H1 : =
,
)
> 0.
Selecting the Attack Probability Model
 We do not know the magnitude of the attack
 We cannot compute the likelihood of H1 until we find a way
to deal with this uncertainty
 Idea: Use the Generalized Likelihood Ratio (GLR)
test:
 Among the class of “attack” distributions, find the one that
best matches the observation
Evaluation
 Most Machine Learning Algorithms Assume a
pool of Negative Examples and a Pool of
Positive examples to evaluate the tradeoff
between false alarms vs. detection rate:
Problem: We Do Not Have Positive Examples
 Because meters were just deployed, we do not
have examples of “attacks”
Our Proposal:
 Find the worst possible undetected attack for
each classifier, and then find the cost (kWh
Lost) of these attacks
Adversary Model
f(t)
a(t)
Real Consumption
Fake Meter Readings
Y1 , . . . , Yn
Utility
Ŷ1 , . . . , Ŷn
n
Goal of attacker: Minimize Energy Bill:
min
Ŷ1 ,...,Ŷn
Ŷi
i=1
Goal of Attacker: Not being detected by classifier “C”:
C(Ŷ1 , . . . , Ŷn ) = normal
100
150
Real
50
Electricity Usage
200
250
Real vs. Attack Signals
0
Attack
0
20
40
Time Slot of Day
60
80
New Tradeoff Curve: No Detection Rates
(can be extended to other fields)
15000
ARMA−GLR
Average
CUSUM
EWMA
LOF
10000
Average Loss per Attack [Wh]
20000
Y-axis: Cost of Undetected Attacks
X-axis: False Positive Rate
0.00
0.05
0.10
0.15
False Positive Rate
0.20
0.25
False Alarms Because of Concept Drift
Asymptotic Effects of Poisoning Attacks
“Valid” Electricity Consumption
 Attacker can use undetected
attacks to poison training data
Undetected Attacks
400
300
200
 We have to “retrain” models
100
 Electricity consumption is a
non-stationary distribution
Consumption (0.1wh)
500
600
 Online Learning
 Concept Drift
0
50
100
150
200
Time
Time (Hours)
Re-train Classifier to
Account for Concept Drift
250
300
Detecting Poisoning Attacks
 Identify concept drift trends helping an attacker
 Lower electricity consumption over time.
 Countermeasure: linear regression of trend
1.0
0.8
0.6
0.4
Determination Coefficient
0.0
0.2
0
2
Determination Coeff.
Slope of Regression Line
−2
−4
−6
Slope of Regression
 Slope of regression was not good discriminant
 Determination coefficients worked!
Honest Users
Original
Attackers
Attack
Honest Users
Original
Attackers
Attack
Talk Outline
 Case example of Classifier Evaluation for
Electricity Theft Detection
 Evaluating Classifiers Analytically
 Game Theory and Adversarial Classification
 Oakland 2006, AAAI 2006, NIPS 2008, Infocom
2007, ToN 2009
 Metric to account for imbalanced datasets
 Future work on big data
Adversarial Classification is a Game
 Traditional ML
 Nature makes first move:
• Statistics of classes
 Classifier makes second move:
• Optimal classifier for the given statistical properties given
 Adversarial ML
 Classifier makes first move:
• Optimal classifier for normal class and guessed attacks
 Attacker makes second move:
• Modify attack after seeing classifier
 A lot of previous work on game theory
 I will focus on evaluation
How Can We Ensure That Classifier
Performance in Deployment is the Same
(or Better) as Metric During Design?
 Minimax strategy is the safety level of the game for
player 1:
 The smallest worst case Φ (error) an attacker can do to
the system
 Step 1: Maximize Φ(D,A) over A (attack parameter)
over a set of possible classifiers D
 Step 2: Minimize Φ(D,A*) over D
 Provable bound of worst performance… D is
“secure” if for every A, Φ<m
Example: Game Theory in ROC curves
 Goal: Select Classifier that Minimizes Probability
of Error
 Attacker has control of prior
 Oakland 2006
Attacker Has Second Move
 If you select Classifier 1 (h1),
 Attacker selects p1 with Pr[Error |h1] = 0.3
 If you select Classifier 2 (h2),
 Attacker selects p2 with Pr[Error |h2] = 0.4
 Can we do better than selecting Classifier 1?
Obtaining Full ROC Curve
 Known to Neyman-Pearson, 1933
 ROCCH popularized by Provost, Fawcett, 2001
Intermission; “Real” ROC Curve
 Barreno, Cárdenas, Tygar, NIPS 2007/08
Theorem: In general, optimal ROC has
where n is the number of classifiers
rules
ROC Randomization = Mixed Strategy
In General, Easier to Reverse Optimization Order
 Minimax = Maximin iff saddle point
Example: MAC-Layer Misbehavior
ASUS
Access
Point
Expected backoff
distribution
Centrino
Digicom
Dlink
Dlink
Linksys
Bianchi et al. Infocom 2007
Adversary Model
 Lemma: The probability that the adversary
accesses the channel is:
>G
 Let
 Then the attack pmfs p1 must belong to the
following set:
Analytical form for optimal attack
The optimal p1 is
where r is the solution to
Attackers have not figured out the optimal
attack
ASUS
Expected
Distribution
Dlink
Dlink
Centrino
Optimal-Attack
Distribution
Digicom
Linksys
SPRT outperforms previous solutions
Expected time to
detect misbehavior
Expected time for a false positive
Talk Outline
 Case example of Classifier Evaluation for
Electricity Theft Detection
 Evaluating Classifiers Analytically
 Metric to account for imbalanced datasets
 Future work on big data
Polygraph Test
 A national security
organization with
10000 employees
has one traitor
(intruder)
 Assume Moe is
tested with a 99%
accurate Polygraph
test:
-  Pr [ A=1 | I=1 ] = 0.99 (P )
-  Pr [ A=0 | I=0 ] = 0.99 (1-P )
D
F
What is the probability that Moe is a traitor?
 Moe tests positive.
 What is the probability that Moe is
a traitor?
A) 0.99
B) 0.01
 10000 employees; one traitor
 PD = Pr [ A=1 | I=1 ] = 0.99
 PF = Pr [ A=1 | I=0 ] = 0.01
C) 0.50
D) ?
The base-rate fallacy problem
 If Pr[ I=1 | A=1 ] = 0.01 sounds
counterintuitive, you are exhibiting the baserate fallacy syndrome
 Ignore base-rates in your probability assessments
 Small base-rates: crying wolf phenomenon
 It is easy to lie with statistics
-  But it is easier to lie without them.
Dan Geer
Base-rate Fallacy in IDS: Axelsson, TISSEC 2000
Alternative Metric to Evaluate IDS?
 PF can be misinterpreted
 Count number of alarms? Good but heuristic
 Replace ROC with graph with posterior
probability
 Problems:
• How do you maximize the posterior probability?
• Uncertain p (probability of attack)
 Precision-Recall curves
 Are always computed empirically with base-rate
from data-set.
Evaluating the Performance of Intrusion
Detection Systems
Metric
ROC
Field
Signal Processing Cost sensitive eval. (Bayes risk)
Decision
Theory/Operations
Research Intrusion Detection Capability CID
Information Theory
Bayesian Detection Rate
Pr[I|A]=PPV
Distinguishability
Sensitivity
Statistics Cryptography
New Metric: IDOC (B-ROC)
 Tradeoff between
 PD=Pr[ A=1 | I=1 ] vs.
 Pr [ I=1 | A=1 ] for different base-rates
PD = Pr[ A=1 | I=1 ]
Pr[ I=1 | A=1 ]
Talk Outline
 Case example of Classifier Evaluation for
Electricity Theft Detection
 Evaluating Classifiers Analytically
 Metric to account for imbalanced datasets
 Future work on big data
CSA: Big Data Working Group
 CSA Big Data Working Group Site
https://cloudsecurityalliance.org/research/bigdata/
 CSA, Big Data LinkedIn
http://www.linkedin.com/groups?home=&gid=4458215&trk=anet_ug_hm
 Basecamp Project Collaboration Site Request
Form
https://cloudsecurityalliance.org/research/
basecamp/
5 Research Directions
 Big data analytics for security intelligence
 Privacy preserving/enhancing technologies
 Big data-scale crypto
 Big data cloud infrastructure and attack
surface reduction
 Policy and governance
The Road to Better Situational Awareness
Intrusion Detection Systems
Network flows, HIDS logs
SIEM
Alarm Correlation
Big Data Security/Analytics
Variety of Data, Security Intelligence
What is new in Big Data?
Traditional Systems
Big Data Promise
 More rigid, predefined
schemas
 Structured and unstructured
data treated seamlessly
 Data gets deleted
 Keep data for historical
correlation (e.g., 10 years)
 Complex analyst queries
take long to complete
 Faster query response times
 Others?
 Others?
Hadoop is de facto open standard for big data at rest
Stream processing? Participation welcome!!
How do we Achieve these Objectives
 Academic audience:
 How do I convince my
peers that the evaluation of
a classifier is:
1.  Technically sound
2.  Follows scientific method
that considers actions of
an attacker
 Industry audience
 How do I convince
customers/industry
partners that our
technology is valuable
 Business case for investing
in new technology
 Value proposition