Download Transaction Process Monitoring

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Learning, Indexing and Diagnosing
Network Faults
Ting Wang†, Mudhakar Srivatsa‡,
Dakshi Agrawal‡ and Ling Liu†
Georgia Institute of Technology†
IBM T.J. Watson Research Center‡
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Complex Networks
 Network as a graph
– Vertices represent network entities
– Edges represent pair-wise (local) interactions between network
entities
 Even simple interactions give rise to complex global network
phenomena
– Fault cascading in communication networks
– Information spread (e.g., via emails) in social networks
– Infection propagation in protein interaction networks
 Key challenge is to detect and understand emerging global
phenomena
2
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Network Monitoring Data
 Networks generate massive
monitoring data (aka events)
– Monitored data consists of local (in
both space & time) observations on
the network
– Monitored data is incomplete and
sometimes even erroneous (e.g.,
imprecise, out-of-order wrt to both
time and causality, etc)
 Examples
– Ping failure, interface down, high CPU utilization, etc. in communication networks
– Email threads (time stamp, tokenized subject, MIME type, etc.) between members in a
organizational hierarchy
– Pathological symptoms in biological networks – protein interaction networks (PINs)
 Key observation: monitoring data gathered from network entities are correlated through
the network topology
3
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Network Patterns
 Network patterns attempt to efficiently capture spatial (topological) and
temporal correlations in monitored data
 Key challenges
– Understand the semantics of network patterns
– Identify domain-specific network patterns (e.g., fault diagnosis & prediction in IT
systems, information spread and access control on social networks, disease propagation
in protein networks, etc)
– How to learn and represent network patterns?
– How to scalably match network patterns against an online stream of network events?
e1
e3
e2
e1
e2
e3
iBGP
server
OSPF
networks
N1 and N2
Update configuration
 withdraw prefix
announcement
N1 says N2 is not
reachable
N2 says N1 is not reachable
Director
D
Employees
N1 and N2
Meeting with D and
N1
Email from N1 to N2
N2 updates project design
document
Person P
Friends
N1 and N2
P updates a blog on
her facebook page
N1 sends friend
request to N2
N2 views P’s updates and accepts
N1’s friend request
Simplified Examples
4
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Network Patterns
 Notation and Formalism
– Event data: <nodeId, type, timestamp,
monitorId>
t13
e1
t11
t12
e2
t22
t23
e3
t33
Temporal Pattern: Markov Chain
– Network Pattern: <event types, spatial
pattern, temporal pattern>
– INTERFACE DOWN  <LINK DOWN,
NEIGHBOR, TIME WINDOW>
 Temporal Pattern
– E.g.: markov chains, frequent item sets
Temporal Pattern: Frequent Item Sets
 Spatial Pattern: Composition/Closures of
one or more topological relationships
– Communication networks: upstream,
downstream, neighbor, tunnel
– Social networks: manages, friends, team
members, IM buddies
– Biological network: catalyst, inhibitor,
suppressor
5
Spatial Pattern: Downstream (transitive closure)
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Fault Diagnosis and Prediction in Communication Networks
 Challenges: improve scalability & expressiveness
of fault-diagnosis
Topology
Topological
Index
– Limitation of current solutions: a complexity that
grows as square of the network size
– Correlation rules are pair-wise: expensive to support
complex fault diagnosis (e.g., predicting soft failures,
router failure from VRF tunnel events, etc)
– Lacks predictive capability
 Approach:
– Fault signatures encode temporal patterns: frequent
item sets, Markov chains; and topological patterns
(spans the network): upstream, downstream,
neighbors, VPN tunnels, etc
– Topologically index streaming monitoring data to
facilitate scalable single-pass event correlation and
fault-diagnosis
– Results in linear complexity – increased scalability
Correlation
Engine (ITNM RCA)
Pair-wise
correlation
rules
Fault
diagnosis
Monitoring
Data
(Omnibus)
Fault Signatures
(Network
Patterns)
Traditional RCA Engine vs. Proposed Approach
Complexity:
Monitoring data x Monitoring data x Rules
Monitoring data x Network Diameter x
Signatures
Monitoring data ~ linear in network size
Network diameter ~ logarithmic in network
size for power-law networks
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Step 1: Learning Network Faults
 Learn fault signatures from historical network event data
–
–
–
–
Fault Synopsis: Fault Type  Network Pattern
Fault Signature: Network Pattern  <Fault Type, Spatial Pattern to Localize Faulty Node>
Fault Diagnosis: <Spatial Pattern to Localize Faulty Node, Network Topology>  Faulty Node
Fault Prediction: Use incrementally matchable network patterns
 Use indexable network patterns
– Topological relationships are invertible: neighbor-1 = neighbor, downstream-1 = upstream
7
Fault Type
up-stream
down-stream
neighbor
…
f1
c1
c2
c3
…
f2
c2
c4
c1
…
Network Pattern
up-stream
down-stream
Neighbor
…
c1
-
f1, p1
f2, p2
…
c2
f1, p1
f2, p2
-
…
c3
-
-
f1, p1
…
c4
f2, p2
Fault
Synopsis
Fault
Signature
…
5/22/2017
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Step 2: Online Matching
 Fault localization using topological indices and hierarchical evidence aggregation
– Topology indexing algorithms + space-time trade off in computing R(x) and R-1(x)
• R Є {upstream, downstream, neighbor, tunnel, …}
– Scalable hierarchical evidence aggregation for efficient fault diagnosis
Network Pattern
up-stream
down-stream
neighbor
VPN Tunnel
c1
Device Down
-
f1
-
c2
-
f2
-
Device Down
c3
-
-
Device Down
f3
…...
bf
n2
c2
…...
…
c3
bf
bf
...
fn-1
fn
bf
bf
…...
bf
…
f2
…
f1
c1
n1
Evidence Aggregation
Scalable Hierarchical Evidence Aggregation
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Details
Interval Filter: segment event dataset into event
bursts
Support Filter: eliminate high frequency (regular n/w
ops) and low frequency burst sets (noise)
Periodicity Filter: eliminate burst sets with high
periodicity (maintenance ops)
Extract
temporal
patterns
Preparation of
training data
Event Datasets
Set of topological
relationships: SE, NE,
DS, US, TN
Markov chains
and maximum
likelihood
estimation
Principle of minimum
explanation
Extract
topological
patterns
Fault
Signatures
OFFLINE LEARNING
Network Topology
ONLINE MATCHING
Event Stream
Min-Heap +
incremental
pattern matching
9
Match
temporal
patterns
Scalable
Evidence
Aggregation
Evidences:
<f, v, Rv>
Fault
Signatures
Inverted Index
for constant
time lookup
Network
Topology
Indexed
network
topology
Space-Time
tradeoffs
Fault
Diagnosis and
Prediction
BIRCH data structure
(hierarchical aggregation)
Optimizations: filter-andrefine (Bloom filter) +
slotted aggregation
(BIGTABLE)
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Fault Diagnosis & Prediction: Scalability
 Result Summary:

SNMP Trap messages from a large enterprise (7
ASes, 32 IGP networks, 871 subnets, 1,268 VPN
tunnels, 2,068 main nodes, 18,747 interfaces and
192,000 entities) over 14 days in 2007
Topology dataset – European backbone network
(2,383 main nodes, spans 7 countries, 11 ASes and
over 100,000 entities)

Network fault simulator and monitoring data
generation

Linear scalability; further optimizations: pruneand-search; slotted hierarchical aggregation
 Ongoing activities


10
Integration with IBM Tivoli Network Management
suite (ITNM) for live testing and fine-tuning
Network patterns for access control on
information flows over : (i) ENRON email data &
organization role topology; (ii) Smallblue data &
social + information network topology
14
Avg Event Processing Time (ms)

12
10
Basic
8
Opt 1
Opt 1, 2
6
4
2
0
0
0.02
0.04
0.06
Fault Rate
0.08
0.1
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Summary
 Network patterns encode spatial-temporal properties of various networks
– Ability to scalably mine and match network patterns is key for
understanding global network phenomena
 Case study on fault diagnosis and prediction in communication networks
– Complexity of solution has to be linear in network size
– Topologically indexed databases was a key tool for addressing
scalability
 Explore more complex network patterns for information, social and
biological networks which exhibit stronger coupling relationships
– A failed router does not cause its neighboring router to fail
– A corrupt information node can corrupt its neighbor (e.g., summary
node)
– A diseased enzyme can catalyze/inhibit its neighbors
11
© 2008 IBM Corporation
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009
Questions?
Mudhakar Srivatsa
[email protected]
12
© 2008 IBM Corporation