Download 0-DMG_CurrentResearch

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Current Research in Data
Mining Research Group
Jiawei Han
Data Mining Research Group
Department of Computer Science
University of Illinois at Urbana-Champaign
Acknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs,
LinkedIn, HP Lab & Boeing
May 4, 2017
1
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
2
Data Mining and Data Warehousing
Jiawei Han’s Group at CS, UIUC



Mining patterns and knowledge discovery from massive data
Data mining in heterogeneous information networks
Exploring broad applications of data mining

Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan,
RankingCube, TruthFinder, NetClus, RankClass, …

600+ research papers, most cited author/group in data mining

ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell
Award; Students: ACM KDD Dissertation Awards (2008, 2013), ……

Textbook, “Data mining: Concepts and Techniques,” adopted worldwide

Funded as NSCTA (Network Science Collaborative Technology
Alliance) by ARL [09-14, 15-19], ARO, NIH KnowEnG, NSF,
Boeing, MSR, Google, Yahoo!, HP Labs, …
Graduated 40+ Ph.D.’s: joined Google, Microsoft Research,
Yahoo! Labs, Facebook, Twitter, as well as professors (14)
Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs


3
Data Mining Research Group in CS,
Univ. Illinois
–
–
–
–
Student Prominent Awards
SIGKDD or SIGMOD Ph.D. Dissertation Awards/
Runner-Ups
10-year impact paper awards
Best student paper awards, best papers, best posters, …
KDDCUP 2013 Runner Up Award
IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships
•
–
–
Graduation:
Professors at UVA, UCSB, PSU, U. Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, …
Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc.
•
–
4
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
5
Mining Sequential Patterns from Shopping Sequences
Sequential pattern mining: Given a set of (shopping) sequences, find
the complete set of frequent subsequences

A sequence database
SID
10
20
30
40
sequence
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
<eg(af)cbc>
Idea of PrefixSpan
<a(bc)dc>: a subsequence of <a(abc)(ac)d(cf)>
s=<a(abc)(ac)d(cf)>
<a>
s|<a>: ( , 2) <(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
Idea of CloSpan
Given support threshold min_sup =2,
<(ab)c> is a sequential pattern

(1)
(2)
(3)
Our innovation:
PrefixSpan (TKDE’04): 1598 citations
CloSpan (SDM’03): 568 (reduce redundancy)
FPgrowth (SIGMOD’00): 4956
Difficulty to generalize it to biosequence
mining: approximate patterns & noise
6
Mining Frequent Subgraph Patterns from Graph DBs
GRAPH DATASET (e.g., Chemical Compound Database)
Graph pattern mining: Given a
set of graphs, find the complete
set of frequent subgraphs
Idea of gSpan
FREQUENT PATTERNS (Let MIN SUPPORT = 2)
Graph pattern growth
+ completeness of
right-most extension
Our innovation:
(1) gSpan (ICDM’02): 1319 citations
(2) CloseGraph (KDD’03): 520 (not to mine
subgraphs covered by their super-patterns)
NCI/NIH AIDS antiviral screen compound data

minsup = 5%
(k+1)-edge
CloseGraph
k-edge
G1
At what condition, can
we stop searching their
Children. i.e., early
termination?
G2
G
…
Gn
Extend to mine
structures in large
single networks
(VLDB’11)
7
Graph Indexing and Graph Similarity Search
Graph Search: Given a query graph Q, find
all the graphs in graph DB containing Q
gIndex key idea: index on frequent and
discriminative substructures (mined)
1.4E+05
140
120
100
80
60
40
20
0
Path
Frequent Structure
Discriminative Frequent Structure
1.2E+05
1.0E+05
8.0E+04
6.0E+04
4.0E+04
2.0E+04
0.0E+00
query graph
graph DB
Graph Index helps search
1k
2k
4k
8k
16k
# indices/ DBsize
GraphGrep
gIndex
Actual Match
4
8
12
16
20
24
# candidates/query size
grafil key idea: explore feature similarity
Query:Q
Graph (G)
Query:Q
Graph Index
Our Innovation:
gIndex (SIGMOD’04): 419 citations
grafil (SIGMOD’05): similarity search
Graph (G)
features
…
Approximate
features
8
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
11
Mining Heterogeneous Information Networks
Heterogeneous networks: Multiple object types and/or multiple link types
Movie
Studio
Venue Paper Author
DBLP Bibliographic Network
Actor
Movie
Director
The IMDB Movie Network
The Facebook Network
Homogeneous networks are info. loss projection of heterogeneous networks!
Directly mining information-richer heterogeneous networks
Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, …
Structured Heterogeneous Network Modeling
Leads to the New Power of Data Mining!

DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
Knowledge hidden in DBLP Network
Mining Functions
How are CS research areas structured?
Clustering
Who are the leading researchers on Web search?
Ranking
What are the most essential terms, venues, authors in AI?
Classification + Ranking
Who are the peer researchers of Jure Leskovec?
Similarity Search
Whom will Christos Faloutsos collaborate with?
Relationship Prediction
Which types of relationships are most influential for an
author to decide her topics?
Relation Strength Learning
How was the field of Data Mining emerged or evolving?
Network Evolution
Which authors are rather different from his/her peers in IR?
Outlier/anomaly detection
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
13
RankClus: Rank-Based Clustering
RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining
A
heterogeneous info networks V
Rank treatments for AIDS from MEDLINE
P
Venue
T
Author
Publish
Database
Write
V
Research
Paper
A
P
Hardware
T
Contain
……
Term
NetClus
Computer
Science
DBLP Schema
V
A
P
Theory
T
RankCompete: Organize your photo album automatically!
14
RankClass: Integration of Tanking and Classification
Knowledge propagation via multi-typed heterogeneous networks
Top-5
ranked
conf.s
Top-5
ranked
terms
Our innovation:
ECMLPKDD'10/KDD’11: integrate ranking and
classification; small training set; knowledge
propagation across typed links; efficient
and scalable
Potential applications:
Biological network mining



Database
Data Mining
AI
IR
VLDB
KDD
IJCAI
SIGIR
SIGMOD
SDM
AAAI
ECIR
ICDE
ICDM
ICML
CIKM
PODS
PKDD
CVPR
WWW
EDBT
PAKDD
ECML
WSDM
data
mining
learning
retrieval
database
data
knowledge
information
query
clustering
reasoning
web
system
classification
logic
search
xml
frequent
cognition
text
DBLP: 4-fields data set (DB, DM, AI, IR)
forming a heterog. info. network
Rank objects within each class (with
extremely limited label information)
Obtain High classification accuracy and
excellent rankings within each class
15
Meta-Path Guided Similarity Search in Networks


Similarity search: Find similar objects in networks
Who are most similar to AnHai Doan?
DBLP Network Schema


Anhai Doan
CS, Wisconsin
Database area
PhD: 2002
Meta-Path: Meta-level description
of a path between two objects
Different meta-paths carry rather
different semantics
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
Jignesh Patel
CS, Wisconsin
Database area
PhD: 1998
Amol Deshpande
CS, Maryland
Database area
PhD: 2004
Our innovation
PathSim (VLDB’11): Similarity search in heterogeneous
networks; a balanced similarity measure; userguidance by selecting different meta-paths
Jun Yang
CS, Duke
Database area
PhD: 2001
Application in biomedical domain
IBM: search for close relationships
among disease, drugs, treatments,
side-effects, and explanations
16
PathPredict: Meta-Path Based Relationship Prediction
Who will be your
new coauthors?
venue
Network schema
publish
topic
mention-1
publish-1
paper
mention
cite/cite-1
contain/contain-1
write-1
write
author
Our contribution
PathPredict (ASONAM’11)
Co-author prediction (A—P—A)
using topological features encoded
by meta paths, e.g., (A—P→P—A).
Which meta-path is more important?
Applications
Meta path-guided prediction:
Infer or predict new relationships
among multi-typed links
Different meta-paths have different prediction
power: p-values obtained from the DBLP data
Co-author prediction for Jian Pei: Only 42 among
4809 candidates are true first-time co-authors!
(Trained based on data collected in [1996, 2002];
Testing period: [2003,2009])
17
Truth Analysis: Enhancing the Quality of
Heterogeneous Information Networks
Motivation: Info. provided can be untrustworthy, error-prone, missing, …
Application: handling conflicting claims on biomedical properties
Experimental datasets: Large and real datasets
Our contribution
 Book Authors from abebooks.com (1263
TruthFinder (TKDE’08): mutual
books, 879 sources, 48153 claims, 2420 bookenhancement of trustworthiness of info
author, 100 labeled)
providers and claims
 Movie Directors from Bing (15073 movies, 12
Latent Truth Model (VLDB’12): modeling
sources, 108873 claims, 33526 movie-director,
two sided truth
100 labeled)
Info provider
w1
Claim
Objects
f1
o1
w2
w3
f2
IMDB
Negative Claim High Precision,
Correct
Claim
f3
f4
Positive
Claim
High Precision,
High Recall
Low Recall
o2
w4
Multiple facts, two-sided claims:
Netflix
Low Precision,
Low Recall
Incorrect Claim BadSource
Harry Potter
18
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
19
Hierarchical Relationship Discovery
 From partially ordered objects to hierarchy (tree)
 Based on NLP or other techniques to extract partially
ordered objects
 Using constraints to discover relationships
Singleton Potential
Type
Homophile
Polarity
Support
pattern
Forbidden
pattern
Cognitive description
Potential definition
Parent and child are similar
Parent is superior to child
Patterns frequently occurring
with child-parent pairs
Patterns rarely occurring
with child-parent pairs
Discovery of the Kenny Family Tree
Pairwise Potential Function: Cases
Type
Cognitive description
Potential definition
Attribute
augment
Label
propagate
Use inherited attributes
from parents or children
Similar nodes share similar
parents (or children)
Patterns altering in childReciprocity
parent & parent-child pairs
Constraints Restrict certain patterns
20
Recursive Construction of a Topical Hierarchy by
Phrase Mining
information retrieval
question answering
relevance feedback
web search
search engine
world wide web
semantic web
Topic discovery
Recursive construction
learning
support vector machines
reinforcement learning
feature selection
Term co-occurrence network
conditional random fields
classification
decision trees
The Framework of CATHY
(Constructing A Topical HierarchY)
Topical phrase mining and ranking
21
Growing Parallel Paths
(WWW 2011)
Path
DIV ...
P
AD
HTML
DIV
HTML
DIV
LI
AB
HTML
P
LI
AC
AE
HTML
Page B
Page E
HTML
HTML
Page C
1
LI
AY
2
LI
AZ
3
LI
AW
4
TD
AU
5
TD
AV
6
X
Y
DIV
UL
Page A
AX
UL
Page D
DIV ...
LI
DIV
P
AF
Page F
DIV
TABLE
Z
UL
TR
W
U
V
Result:
22
WinaCS: Web Information Network
Analysis for Computer Science
Name
Tarek Abdelzaher
Sarita Adve
Vikram Adve
Gul Agha
Eyal Amir
Dan Roth
Jiawei Han
Zipcode
--------
rsim.cs.illinois.edu/
~sadve/
URL
--------
llvm.cs.uiuc.edu
/~vadve/Home.html
l2r.cs.uiuc.edu
/~danr/
www.cs.illinois.edu
/homes/hanj/
Mappings
Web Pages
Structured Data
Database records can be found on link paths!
Faculty
/people
Vikram Adve
/people
/faculty
/people
/faculty
/vikramadve
Personal
Site
llvm.cs.uiuc.edu
/~vadve/Home.html
Dan Roth
People
Jiawei Han
/ (root)
[cs.illinois.edu]
/people
/faculty
/dan-roth
Personal
Site
l2r.cs.uiuc.edu
/~danr/
Research
Data
Mining
/research
Dan Roth
/research
/areas
/data
Jiawei Han
/people
/faculty
/jiawei-han
Personal
Site
www.cs.illinois.edu
/homes/hanj/
23
Research-Insight [SIGMOD’13 Demo]
Query on “Jim Gray”
Query on “Machine Learning”
Advisor-Advisee result for “Kevin Chang”
Potential collaborators for “Jiawei Han”
24
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
25
Event Cube: An Overview
Funded by NASA (2008-2010)
Analysis
Support
…
Analyst
Multidimensional OLAP, Ranking, Cause Analysis,
……
Topic Summarization/Comparison
Topic
Topic
turbulence
birds
undershoot
Event Cube
Representation
Encounter
Deviation
overshoot
LAX
SJC MIA AUS
Location
98.02
98.01
99.02
99.01
drilldown
1998
1999
CA
FL TX
Location
roll-up
Multidimensional
Text Database
Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 26
Text/Topic Cube: General Idea

Heterogeneous: categorical attributes + unstructured text
ACN Time
Location
Place
Environment
……
Event
Report
Text data


How to combine?
Our solution:
Cube: Categorical Attributes
Measure
Term/Topic
Weight
T1
W1
T2
W2
T3
W3
…
…
Text/Topic Model: Unstructured Text
27
Effective OLAP Exploration


TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube
TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP
exploration
Healthcare
Reform
28
EventCube Snapshot: Query Result
29
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
30
MoveMine: Mining Moving Object Databases
A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining
Moving Object Databases", SIGMOD’10 (system demo)
31
31
Longitude
longitude
Mining Spatiotemporal and Mobility Data
#1
#2
Raw movement data (time series view)
8
7.5
7
6.5
#4
latitude
Latitude
6
0
500
1000
1500
2000
2500
3000
3500
2000
2500
3000
3500
time
46.8
46.6
46.4
#3
46.2
46
0
500
1000
1500
time
density map
#1
#2
#4
#3
Time (hour)
Spot #1: Office
Spot #2:
Commuting city
Spot #3: Home
Spot #4:
Vacation place
32
Mining Periodicity in Sparse Data [KDD12]
 Event has a period of 20
 Occurrences of the event happen between 20k+5 to 20k+10
Event has a period of 20. Occurrences of the event happen between 20k+5 to 20k+10.
5
13
18
26 29
Segment the data using length 20
48 50
62
67
Time
79
Segment the data using length 16
Overlay the segments
Overlay the segments
Observations are clustered in [5,10] interval.
Observations are scattered.
33
GeoTopic Discovery: Mining Spatial Text
Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
Geo-tagged photos w. landscape (coast vs. desert vs. mountain)
LDM
TDM
GeoFolk
LGTA
34
LPTA: Latent Periodic Topic Analysis: Discovery of
Temporal Patterns of Topics



Periodic topic: repeating in regular intervals
Background topic: covered uniformly over the entire period
Bursty topic: A transient topic that is intensively covered only in a certain time period
Time distribution of topics
Integration of both text and time in analysis
35
Social Relationship Mining from Sensor Trace Data



T-Motif: a time interval [S,T], that
 many positive pairs meet at that
time
 few negative pairs meet at that
time
Ex.: MIT Reality mining dataset:
 94 people tracked for 10 months
 Use only spatiotemporal info
Algs. for efficient mining of T-motifs
and effective classification
36
Mining RFID Data to Explore Trajectories
(Factory, T1,T2)
Warehousing and mining RFID data
(Checkout,T9,T10)
(Shelf, T7,T8)
37
Outline

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Conclusions
38
Conclusions

An Introduction to Data Mining Research Group

Pattern Discovery Methods

Mining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from
Unstructured Data

TextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and Networks

Lots to be done in this promising research frontier!
39