Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Graph-based
Pattern Learning
Dr. Larry Holder
School of EECS, WSU
Graphs
Protein-protein
Interaction
Social
Network
Internet
Power
Grid
Web
Some Graph Statistics
• Web
 10B pages, 1T hyperlinks
 Topology storage: 10TB
 Google PageRank: Eigenvector on 10Bx10B
adjacency matrix (sparse)
• MySpace
 100M users, 10B friendship links
 Clique/community detection
 300K new users per day
Graph Problems
• Degree
• Diameter
• Centrality
• Shortest path
• Cycles/tours
• Minimum spanning tree
• Traversals/search
• Connectivity
• Clustering
• Partitioning
• Cliques
• Motifs
• Subgraph isomorphism
• Frequent subgraphs
• Pattern learning
• Dynamics
Graph-based Pattern Learning
• Unsupervised pattern discovery
• Hierarchical conceptual clustering
• Supervised pattern learning
• Anomaly detection
• Dynamic graph pattern learning
Unsupervised Pattern Discovery
• Frequency-based (AGM, gSpan, FSG, Gaston)
 “Graph-based Data Mining”
 Find all subgraphs g within a set of graph transactions G
such that
g  g  | g   G
|G|




t
where  is subgraph isomorphism and
t is the minimum support
Focus on pruning and fast, code-based graph matching
Still requires subgraph isomorphism
Unsupervised Pattern Discovery
• Graph compression and the minimum
description length (MDL) principle
 The best theory minimizes the description
length of the theory and the description
length of the data given the theory
S1
• The best graphical pattern S minimizes
the description length of S and the
description length of the graph G
compressed with pattern S
S1
S1
S1
S1
min ( DL ( S )  DL (G | S ))
S
• where description length DL(G) is the
•
minimum number of bits needed to
represent G (SUBDUE)
Compression can be based on inexact
matches to pattern
S2
S2
S2
Hierarchical Conceptual Clustering
• Use iterative process on input
graph G
 Repeat
• Find best pattern S in graph G
• Add S to hierarchy
• G = G compressed with S
 Until no more compression
• Clustering is a lattice
• Clusters described by pattern
 Not just instances as in
traditional clustering techniques
Hierarchical pattern discovered at 7th iteration of SUBDUE
Mock Terrorist
Observables
Scenario
Event Generator Message Traffic
Reports (142)
Fund raising
Recruitment
Training
Reconnaissance
...
SRA TEES
Text Extraction
System
male
place
location
location
location
male
place
location
male
location
affiliation
organization
Convert
to
Graph
SUBDUE
Pattern
Learner
male
place
location
location
Entities
and
Relationships
DHS Insight Project
Terrorist Group Data
location
male
location
male
organization
affiliation
affiliation
affiliation
affiliation
male
Patterns
affiliation
male
male
male
affiliation
male
organization
place
Supervised Learning
• Given positive graph G+ and negative graph G• Find pattern S minimizing DL(G+ | S) / DL(G- | S)
• When |G+|,|G-| >> 1, find pattern S maximizing
classification accuracy:
| {g  G  | S  g} |  | {g  G  | S  g} | TP  TN



|G ||G |
PN
Positive
Graphs
Negative
Graphs
SUBDUE
Pattern(s)
DARPA/AFRL
Evidence Assessment, Grouping, Linking and Evaluation (EAGLE) Program
Convert EDB to
SUBDUE graph
format
Positive & negative
examples
EDB
Threat
Evidence DB (EDB)
contains simulated data
on threat and non-threat activity
• Persons, targets, capabilities,
resources, transfers, and
communications
Nonthreat
SUBDUE
Results
Examples
Entities
Relations
Accuracy
Time
Events
308
533,196
630,733
80%
86 min
Groups
84
457,209
597,163
85%
813 min
Patterns
Evaluate
Graph Regression (with Nikhil Ketkar, WSU)
•
Learn a model Yi = f(Gi ), where Yi is
a real number and Gi is a graph

•
E.g., solubility or binding activity of
chemical compounds
One approach
 Apply frequent-graph miner to set of
training graphs Gi
 Frequent subgraphs form a feature
vector V
 Input {(Yi, Vi)} to linear support-vector
machine
•
gRegress approach
 Prune feature set based on correlation
with other features and lack of
correlation with Y
•
Learn model using non-linear SVM or
piece-wise regression
Anomaly Detection (with Bill Eberle, TTU)
• Learn normative patterns of activity
• Detect small, unlikely deviations from normative patterns
• Present anomalies and their context to analyst
Convert to graph
Graph-Based
Anomaly
Detection
(GBAD)
Normative
Pattern
SUBDUE
Activity Data
Anomaly
GBAD
GBAD Approach
• Determine normative pattern S using SUBDUE
minimum description length (MDL) heuristic that
minimizes: M(S,G) = DL(G|S) + DL(S)
• Three algorithms for handling each of the different
anomaly categories
 GBAD-MDL finds anomalous modifications
 GBAD-P (Probability) finds anomalous insertions
 GBAD-MPS (Maximum Partial Substructure) finds
anomalous deletions
DHS Insight Project: Cargo Data
• Shipment data from PIERS (Port
•
•
•
“020601”
Import Export Reporting Service)
Only North American imports
(U.S., Puerto Rico, Canada)
65,535 records (shipments)
Information categories:
“EMPTY RACK”
“YOKOHAMA”
VDATE
FPORT
“CONTAINER FOR
ONE OR
MORE MODES OF
TRANSPORT”










“SEATTLE”
COMMODITY
ARRIVAL_INFO
USPORT
COMMODITY
HARM_DESC
COUNTRIES_AND_PORTS
HAS_A
TARIFF
HSCODE
HAS_A
HAS_A
SHIPMENT
860900
CONTAINER
CONTAINER
COUNTRY
HAS_A
“JAPAN”
HAS_A
“TOLU4972933”
General
Commodity codes
Countries and ports
U.S. company names and locations
Foreign shipper names and locations
Notification party names and locations
Shipping line, vessel and packaging
Container
Weight and shipment
Financial
HAS_A
HAS_A
US_IMPORTER
FINANCIAL
HAS_A
NAME
VALUE
HAS_A
“AMERICAN TRI NET EXPRESS”
27579
CARGO
BOL_NBR
HAS_A
FOREIGN_SHIPPER
00434100
FNAME
HAZMAT_FLA
VESSEL
SLINE
“TRI NET”
“”
“CSCO”
MTONS
VESSEL
CONSIZE
TEUS
“”
5.60
0.00
VOYAGE
“LING YUN HE”
36
Anomaly Detection in Cargo Data
• Marijuana seized at port on Florida [U.S. Customers
•
•
Service 2000].
Smuggler did not disclose some financial
information, and ship traversed extra port.
GBAD-P discovers the extra traversed port; GBADMPS discovers the missing financial information.
DHS CyberSecurity R&D Program:
Insider Threat Detection using Graphs
Gov’t ID
Request
Processing
Insider Threat Scenarios (CERT Insider Threat Documents)
1. Frontline staff reviews case (invasion of privacy).
2. Frontline staff submits case directly to a case officer
(bypassing the approval officer).
3. Frontline staff recommends or decides case.
4. Approval officer reverses accept/reject recommendation from
assigned case officer.
5. Unassigned case officer updates or recommends case.
6. Applicant communicates with approval officer or case officer.
7. Unassigned case officer communicates with applicant.
8. Database access from an external source or after hours.
GBAD on
Scenario 1
GBAD on
Scenario 4
• 1000 cases
• Multiple
normative
patterns
• 1-3 anomalies
• No false
positives
Dynamic Graph Pattern Learning
(with Chang hun You, WSU)
• Dynamic graph DG = {G1, G2, …, Gn}
• Find graph rewrite rules between pairs of graphs
Gi / Gi+1
 Find common subgraph between Gi and Gi+1
 Remainder of Gi to be removed (GR)
 Remainder of Gi+1 to be added (GA)
• Find transformation rules of temporal patterns in
rewrite rules
 Remove (GR) at time t, then add (GA) at time t+k
Dynamic Graph (BioNet)
Graph Rewriting Rule
Example: Circadian Rhythm in
Drosophila (Fruit Fly)
Example: Circadian Rhythm in
Drosophila (Fruit Fly)
 Transformation rule (Sub 1):
Structure appearing and
disappearing in network.
Full temporal transformation rule: 
Boxes are removals (after 5 hours),
and ellipses are additions (after 7
hours) of Sub 1. Cycles every 12
hours. Time 6-47 is training; time
54-66 is prediction.
Graph-based Pattern Learning
• Algorithms
 Pattern discovery and
clustering
 Supervised learning
 Anomaly detection
 Dynamic graphs
• Applications






Social networks
Biological networks
Computer networks
Process flows
(Semantic) Web 
…
linkeddata.org
High Performance Computing Issues
• Memory bottleneck
 Most real-world graphs do not fit in main memory
 Patterns of access to graph not sequential
• Computational bottleneck
 Graph and subgraph isomorphism
High Performance Computing Issues
• Functional parallelism
 Parallel search over space of candidate subgraph
patterns
• High communication to avoid redundancy
• Child patterns rely on embeddings kept with parent
 Hinders parallelism
 Computing embeddings from scratch is NPC
• Data parallelism
 Partition graphs, find patterns in each partition, evaluate
patterns in other partitions
• Edge cuts may break patterns
• May require NPC subgraph isomorphism
Data-Intensive Scalable Computing
• MapReduce [Google]
•
 Dean & Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” OSDI 2004.
Hadoop [Yahoo]
 MapReduce
Map
 Distributed filesystem
Reduce
Multiscale Issues
• Hierarchical networks
 Higher-level hyper-nodes
summarize detail at lower
levels
 E.g., Netflix prize
(www.netflixprize.com)
• 17K movies, 400K users,
100M reviews
• E.g., user’s average rating vs.
specific ratings
• E.g., movie’s average rating
vs. specific rating
5
rating
3.5
4.5
review
avg.
rating
user
movie
user
movie
(reviews…)
avg.
rating
title
“Matrix”
Conclusions
• Graph representation of relational data
• Graph-based pattern learning improves understanding of
•
•
•
•
modeled behavior
Massive, dynamic graphs
Numerous application domains
Graph problems computationally and memory intensive
HPC (data-intensive computing) and multiscale approaches
For More Information
• Larry Holder, School of EECS, WSU
 Email: [email protected]
 URL: www.eecs.wsu.edu/~holder
• SUBDUE
 Source code in C
 Datasets
 www.subdue.org
• D. Cook and L. Holder (2006). Mining Graph Data,
Wiley. (www.eecs.wsu.edu/mgd)
Related documents