Download Data Mining Approaches for Intrusion Detection

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data mining for crime
detection
Sunita Sarawagi
IIT Bombay
http://www.it.iitb.ac.in/~sunita
Digital age

Data about people, organizations, operations
increasingly available online
–
–
–
–
–
–
–

Phone calls, credit card and ATM usage
Birth records, employment records
Residence, land owned
Countries visited
Types of licenses
Places visited: monitoring cameras
Money transfers
Cheap storage, fast machines make it easy to
store and analyze these.
Digital crimes

Crimes are increasingly technical
– Credit card frauds
– Stock market scams
– Hacker attacks on government computers and
networks
– Insurance frauds
Data mining

Process of semi-automatically analyzing
large databases to find patterns that are:
– valid: hold on new data with some certainity
– novel: non-obvious to the system
– useful: should be possible to act on the item
– understandable: humans should be able to
interpret the pattern
Existing applications

Banking: loan/credit card approval
– predict good customers based on old customers

Customer relationship management:
– identify those who are likely to leave for a competitor.

Targeted marketing:

– identify likely responders to promotions
Medicine: disease outcome, effectiveness of treatments
– analyze patient disease history: find relationship between
diseases
Applications in crime investigation

Fraud detection: telecommunications, financial
transactions
– from an online stream of event identify fraudulent
events

Interpret insurance claims (in text format)
– Classify claim as valid or not

Detect attacks and intrusions on computers and
networks by profiling normal behaviour
 Health insurance frauds
– Cohorts of doctors that ping-pong patients to each
other

Identify links amongst people
The KDD process
Data
warehouse
Extract
data via
ODBC
Preprocessing
utilities
•Sampling
•Attribute
transformation
Scalable algorithms
• association
• classification
• clustering
• sequence mining
Mining
operations
Visualization
Tools
Mining operations
Sequence mining
Clustering
Classification
Time series similarity
 hierarchical
 Regression
Temporal patterns
 Classification trees
 EM
 Neural networks
 density based
 Bayesian learning
 Nearest neighbour
Itemset mining
 Radial basis functions
Association rules
 Support vector
Causality
machines
 Meta learning methods
– Bagging,boosting
Classification

Given old data about customers and payments,
predict new applicant’s loan eligibility.
Previous customers
Age
Salary
Profession
Location
Customer type
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/
bad
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
 Regression: (linear or any other polynomial)
– a*x1 + b*x2 + c = Ci.

Nearest neighour
 Decision tree classifier: divide decision space
into piecewise constant regions.
 Probabilistic/generative models
 Neural networks: partition by non-linear
boundaries
Nearest neighbor

Define proximity between instances, find
neighbors of new instance and assign majority
class
 Case based reasoning: when attributes are
more complicated than real-valued.
• Pros
+ Fast training
• Cons
– Slow during application.
– No feature selection.
– Notion of proximity vague
Decision trees

Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
Degree = BSc
Score > 90
Good
Rank < 6
Bad
Bad
Good
Neural network

Set of nodes connected by directed
weighted edges
Basic NN unit
x1
w1
x2
w2
x3
w3
A more typical NN
n
x1
i 1
x2
o   (  wi xi )
1
 ( y) 
1  e y
x3
Output nodes
Hidden nodes
Used for face recognition
and other image
recognition tasks
Association rules
T
Given set T of groups of items
Milk, cereal
 Example: set of baskets of items
Tea, milk
purchased
 Goal: find all rules on itemsets of the Tea, rice, bread
form a-->b such that

– support of a and b > user threshold s
– conditional probability (confidence) of b
given a > user threshold c

Example: Milk --> bread
 Lot of work done on scalable
algorithms
cereal
Applications of fast itemset counting
Find correlated events:
 Applications in medicine: find redundant
tests
 Cross selling in retail, banking
 Intrusion detection
Case study: data mining for
network instrusion
detection
Fighting intrusion

Prevention: isolate from network, strict
authentication measures, encryption

Preemption:
– “do unto others before they do unto you”

Deterrence: dire warnings,
– “we have a bomb too.”

Deflection: diversionary techniques to lure away
 Detection
 Counter attacks
Intrusion detection methods

Anomaly-based
– study typical patterns of normal use and detect
abnormal usage
– cannot distinguish illegal from abnormal

Signature-based
– model signature of previous attacks and flag
matching patterns.
– Cannot detect new intrusions
Use hybrid
Intrusion detection methods

Automatic rules
– Use historical audit trails and an intelligent learning
technique to model normal and intrusion traffic
– May not provide full coverage

Policy-driven rules
– A security expert codifies rules
– Manually intensive, might miss patterns, may not
evolve as normal usage pattern slowly drifts.
Use hybrid
Current Intrusion Detection Approaches

Main problems: manual and ad-hoc
– Misuse detection:
• Known intrusion patterns have to be handcoded
• Unable to detect any new intrusions (that have
no matched patterns recorded in the system)
– Anomaly detection:
• Selecting the right set of system features to be
measured is ad hoc and based on experience
• Unable to capture sequential interrelation
between events
Data Mining

Why is it applicable to intrusion detection?
– Normal and intrusive activities leave evidence
in audit data
– From the data-centric point of view, intrusion
detection is a data analysis process
– Successful applications in related domains,
e.g., fraud detection, fault/alarm management
Relevant data mining algorithms
Classification: maps a data item into one of
several pre-defined categories
 Link analysis: determines relations between
fields in the database
 Sequence analysis: models sequence
patterns

Intrusion detection

Intrusions could be detected at
– Network-level (denial-of-service attacks,
open
port-scans, etc)
lseek
• Sequence of TCP-dumps
– Host-level (attacks on privileged
programs like lpr, sendmail)
• Sequence of system calls
• |S| = set of all possible system calls ~100
lstat
mmap
execve
ioctl
ioctl
close
execve
close
unlink
Classification Models on sendmail

Philosophy:
– For most privileged programs the short sequences of
system calls made during its normal executions are
very consistent, yet different from the sequences of its
abnormal (exploited) executions as well as the
executions of other programs

The sendmail data:
– Each trace has two columns: the process ids and the
system call numbers
– Normal traces: sendmail and sendmail daemon
– Abnormal traces generated by known sendmail
attacks: sunsendmailcap, syslog-remote, syslog-local,
decode, sm5x and sm56a attacks.
Classification Models on sendmail

Data preprocessing (most challenging)
– Converting sequential data to record data
– Use sliding window to create sequence of
consecutive system calls
– Label the sequences to create training data:
sequences (length 7)
class labels
4 2 66 66 4 138 66
“normal”
5 5 5 4 59 105 104
“abnormal”
…
…
Classification Models on sendmail

Learning patterns of normal sequences:
– Each record: n consecutive system calls
plus a class label, “normal” or “abnormal”
– Training data: sequences from 80% of the
normal traces plus some of the attack
traces
– Testing data: traces not used in training,
including some unknown attacks
– Use RIPPER to learn specific rules
sendmail Experiment 1

Examples of output RIPPER rules:
– if the 2nd system call is vtimes and the 7th is
vtrace, then the sequence is “normal”
– if the 6th system call is lseek and the 7th is
sigvec, then the sequence is “normal”
–…
– if none of the above, then the sequence is
“abnormal”
sendmail Experiment 1

Using the learned rules to analyze a new
trace:
– label all sequences according to the rules
– define a region as l consecutive sequences
– define a “abnormal” region as having more
“abnormal” sequences than normal ones
– calculate the percentage of “abnormal” regions
– the trace is “abnormal” if the percentage is above
a threshold
sendmail Experiment 1
– Training data includes sequences from
intrusion traces in, and sequences from
80% of the normal sendmail traces
– Sequence length = 7
– Percentage of abnormal “regions” of each
trace (showed in the table) is used as the
intrusion indicator
– The output rule sets contain ~250 rules,
each with 2 or 3 attribute tests. This
compares with the total ~1,500 different
sequences.
sendmail Experiment 1
traces
sscp-1
sscp-2
sscp-3
syslog-remote-1
syslog-remote-2
syslog-local-1
syslog-local-2
decode-1
decode-2
sm565a
sm5x
sendmail
% OF ABNORMAL TRACES
32.2
30.4
30.4
21.2
15.6
11.1
15.9
2.1
2.0
8.0
6.5
0.1
Illegal use of sendmail has much higher fraction of abnormal
traffic as found by the rule-learner
Mining Models on tcpdump data
Packets of incoming, out-going, and
internal broadcast traffic
 Much richer than previous sendmail data
 Needs extensive preprocessing to convert
to usable form

Preprocessing Steps

Raw audit data converted to ASCII level packets
 Packets aggregated to connection level records
– Record connection attempts
– Monitor data packets and count: # of bytes in each
direction
– Watch how connection is terminated

Use feature selection methods from data mining
to augment connection-records with temporal
features
Structure of connection record

Each record has:
– start time and duration
– participating hosts and ports (applications)
– statistics (e.g., # of bytes)
– flag: “normal” or a connection/termination error
– protocol: TCP or UDP
– Collection of temporal features extracted using
data mining, example in PortScan multiple
rejected connections to same host
Manually chosen features

“generic” features:
–
–
–
–
protocol (service),
protocol type (tcp, udp, icmp, etc.)
duration of the connection,
flag (connection established and terminated properly, SYN
error, rejected, etc.),
– whether the connection is from/to the same ip/port pair.

“content” features (only useful for TCP
connections):
–
–
–
–
–
–
# of failed logins,
successfully logged in or not,
# of root shell prompts,
“su root” attempted or not,
# of access to security control files,
# of compromised states (e.g., “Jumping to address”, “path not
found” …),
– # of write access to files,
– # of hot (the sum of all the above “hot” indicators),
Features from mined patterns:

temporal and statistical “traffic” features:
• # of connections to the same destination host
as the current connection in the past 2
seconds, and among these connections,
• # of rejected connections,
• # of different services,
• rate (%) of connections that have the same
service,
• rate (%) of different (unique) services.
Experimental setup





DARPA provided: normal traffic mixed with
simulated intrusions in a military network
4 GB of tcpdump of 7 week network traffic
5 million connection records of 100 bytes each
Test records: 2 million over two weeks, 38 attack
types of which 14 only in test data
Attack types:
• DOS, denial of service attack
• PROBING, port-scan
• Unauthorized access from a remote machines, guessing
password
• Unauthorized access to local superuser: buffer-overflow
attacks
Example rules

buffer_overflow :– hot >= 3, compromised >= 1,
su_attempted <= 0, root_shell >= 1.

ipsweep :– protocol = eco_i, srv_diff_host_rate > =
0.5, count <= 2, srv_count >= 6.

smurf :– protocol = ecr_i, count > =5, srv_count >=
5.
Results

Compared with four other groups using
knowledge engineering approaches
 This method was
– best for PROBING attacks.
– one of the best in DOS and user to root attacks.
– No system good at detecting remote to local attacks

Difficulties of other methods
– Large amount of data
– Does not generalize
Mining market

Generic data mining tools (Around 20 to 30
mining tool vendors)
– SAS’s Enterprise Miner,
– Clementine,
– IBM’s Intelligent Miner,

Many DBMS and data warehousing systems
come packaged with standard data mining tools
 Several fraud detection products
http://www.kdnuggets.com/solutions/fraud-detection.html
Data warehousing and
analysis
Data warehousing
Direct
Query
Merge
Clean
Summarize
Reporting
tools
OLAP
Crystal reports
Essbase
Mining
tools
Intelligent Miner
Relational
DBMS+
e.g. Redbrick
Data warehouse
Detailed
transactional
data
GIS
data
Bombay branch Delhi branch
Oracle
Calcutta branch
IMS
Operational data
Census
data
SAS
Multidimensional Data analysis

Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
Typical OLAP Operations

Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions

Slice and dice:
– project and select

Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D
planes.
OLAP
 Navigational
operators: Pivot, drilldown, roll-up, select.
 Hypothesis driven search: E.g. factors
affecting defaulters
– view defaulting rate on age aggregated over
other dimensions
– for particular age segment detail along
profession
 Demo
in the afternoon
Link data from several data sources
DERYCK ,D ,SOZA
SOUZA ,D ,D ,,
GORA
VILLA
,03 ,GERA VILLA
CHAFEKAR
,RAMCHANDRA
….
,VIMAN NAGAR PUNE
,411014 ,1
VIMAN NAGAR
411014 ,
Taxpayers
Duplicates:
Land records Passport Transport
Telephone
SOUZA ,D ,D ,,GORA VILLA ,,VIMAN NAGAR ,,,411014 ,
DERYCK ,D ,SOZA ,03 ,GERA VILLA ,,VIMAN NAGAR PUNE, 411014
Non-duplicates:
CHAFEKAR ,RAMCHANDRA ,DAMODAR ,SHOP 8 ,H NO 509 NARAYAN PETH PUNE 411030
CHITRAV ,RAMCHANDRA ,D ,FLAT 5 ,H NO 2105 SADASHIV PETH PUNE 411 030
Machine Learning approach
Given examples of duplicates and non-duplicate
pairs, learn to predict if pair is duplicate or not.
Input features:

Various kinds of similarity functions between attributes



Edit distance, Soundex, N-grams on text attributes
Absolute difference on numeric attributes
Capture domain-specific knowledge on comparing data
The learning approach
Example
labeled
pairs
Similarity functions
f1 f2 …fn
Record 1 D
Record 2
1.0 0.4 … 0.2 1
Record 1 N
Record 3
0.0 0.1 … 0.3 0
Record 4 D
Record 5
Unlabeled list
Record
Record
Record
Record
Record
Record
6
7
8
9
10
11
Similarity
functions
YearDifference > 1
Non-Duplicate
All-Ngrams  0.48
Non Duplicate
AuthorTitleNgrams  0.4
Classifier TitleIsNull < 1
PageMatch  0.5
0.3 0.4 … 0.4 1
Duplicate
Duplicat
e
AuthorEditDist  0.8 Duplicate
Mapped examples
Non-Duplicate
0.0
1.0
0.6
0.7
0.3
0.0
0.3
0.6
0.1
0.4
0.2
0.1
0.4
0.1
0.8
0.1
… 0.3
… 0.2
… 0.5
… 0.6
… 0.4
… 0.1
… 0.1
… 0.5
?
?
?
?
?
?
?
?
0.0 0.1 … 0.3
Duplicate
1.0
0.6
0.7
0.3
0.0
0.3
0.6
0.4
0.2
0.1
0.4
0.1
0.8
0.1
… 0.2
… 0.5
… 0.6
… 0.4
… 0.1
… 0.1
… 0.5
0
1
0
0
1
0
1
1
Experiences with the learning approach

Too much manual search in preparing
training data
– Hard to spot challenging and covering sets of
duplicates in large lists
– Even harder to find close non-duplicates that
will capture the nuances
examine instances that are similar on one
attribute but dissimilar on another
Active learning is a generalization of this!
The active learning approach
Example
labeled
pairs
Similarity functions
f1 f2 …fn
Record 1 D
Record 2
1.0 0.4 … 0.2 1
Record 3 N
Record 4
0.0 0.1 … 0.3 0
Unlabeled list
Record
Record
Record
Record
Record
Record
6
7
8
9
10
11
Classifier
0.0
1.0
0.6
0.7
0.3
0.0
0.3
0.6
0.1
0.4
0.2
0.1
0.4
0.1
0.8
0.1
… 0.3
… 0.2
… 0.5
… 0.6
… 0.4
… 0.1
… 0.1
… 0.5
?
?
?
?
?
?
?
?
Active
learner
0.7 0.1 … 0.6 1
0.3 0.4 … 0.4 0
0.7 0.1 … 0.6 ?
0.3 0.4 … 0.4 ?
Architecture of ALIAS
Lp
Initial
training
records
Similarity
Functions (F)
D
Unlabeled
Input
records
Mapper
Mapped
labeled
instances
Training
data T
Infer pairs
using
transitivity
Train classifier
Dp
Mapper
Pool of
mapped
unlabeled
instances
Select
instances
S
Similarity
Indices Predicate for uncertain region
A
Large
record
lists
Deduplication function
Evaluation engine
Groups of
duplicates in A
Active
Learner
Benefits of active learning
Learning de-duplication function on Bibtex entries

With 100 pairs:
– Active learning:
97% (peak)
– Random: only
30%