Download No Slide Title - UCLA Computer Science

Document related concepts

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ACM SIGKDD Conference Tutorial, Washington, D.C., July 25, 2010
Mining Heterogeneous Information Networks
Jiawei Han† Yizhou Sun† Xifeng Yan§
Philip S. Yu‡
†University of Illinois at Urbana‐Champaign
§ University of California at Santa Barbara
‡University of Illinois at Chicago
Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), Microsoft, IBM, Yahoo!, Google, HP Lab & Boeing
July 12, 2010
1
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
2
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
3
What Are Information Networks?
ƒ
ƒ
Information network: A network where each node represents an entity (e.g., actor in a social network) and each link (e.g., tie) a relationship between entities
ƒ
Each node/link may have attributes, labels, and weights
ƒ
Link may carry rich semantic information
Homogeneous vs. heterogeneous networks
ƒ
ƒ
Homogeneous networks
ƒ
Single object type and single link type
ƒ
Single model social networks (e.g., friends)
ƒ
WWW: a collection of linked Web pages
Heterogeneous, multi‐typed networks
ƒ
Multiple object and link types
ƒ
Medical network: patients, doctors, disease, contacts, treatments
ƒ
Bibliographic network: publications, authors, venues
4
Ubiquitous Information Networks
ƒ
Graphs and substructures
ƒ
Chemical compounds, computer vision objects, circuits, XML
ƒ
Biological networks
ƒ
Bibliographic networks: DBLP, ArXiv, PubMed, …
ƒ
Social networks: Facebook >100 million active users
ƒ
World Wide Web (WWW): > 3 billion nodes, > 50 billion arcs
ƒ
Cyber‐physical networks
An Internet Web
Yeast protein
interaction network
Co-author network
Social network sites
5
Homogeneous vs. Heterogeneous Networks
Co-author Network
Conference-Author Network
6
DBLP: An Interesting and Familiar Network
ƒ
ƒ
ƒ
DBLP: A computer science publication bibliographic database
ƒ 1.4 M records (papers), 0.7 M authors, 5 K conferences, …
Will this database disclose interesting knowledge about computer science research?
ƒ What are the popular research fields/subfields in CS?
ƒ Who are the leading researchers on DB or XQueries?
ƒ How do the authors in this subfield collaborate and evolve?
ƒ How many Wei Wang’s in DBLP, which paper done by which? ƒ Who is Sergy Brin’s supervisor and when?
ƒ Who are very similar to Christos Faloutsos? ……
All these kinds of questions, and potentially much more, can be nicely answered by the DBLP‐InfoNet
ƒ How? Exploring the power of links in information networks!
7
Homo. vs. Hetero.: Differences in DB‐InfoNet Mining
ƒ
ƒ
ƒ
ƒ
Homogeneous networks can often be derived from their original heterogeneous networks
ƒ Coauthor networks can be derived from author‐paper‐
conference networks by projection on authors only
ƒ Paper citation networks can be derived from a complete bibliographic network with papers and citations projected
Heterogeneous DB‐InfoNet carries richer information than its corresponding projected homogeneous networks
Typed heterogeneous InfoNet vs. non‐typed hetero. InfoNet (i.e., not distinguishing different types of nodes)
ƒ Typed nodes and links imply a more structured InfoNet, and thus often lead to more informative discovery
Our emphasis: Mining “structured” information networks!
8
Why Mining Heterogeneous Information Networks?
ƒ
ƒ
Most datasets can be “organized” or “transformed” into a “structured” heterogeneous information network!
ƒ Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
ƒ Structures can be progressively extracted from less organized data sets by information network analysis
ƒ Information‐rich, inter‐related, organized data sets form one or a set of gigantic, interconnected, multi‐typed heterogeneous information networks
ƒ Surprisingly rich knowledge can be derived from such structured heterogeneous information networks
Our goal: Uncover knowledge hidden from “organized” data
ƒ Exploring the power of multi‐typed, heterogeneous links
ƒ Mining “structured” heterogeneous information networks!
9
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
10
Clustering and Ranking in Information Networks
ƒ
Integrated Clustering and Ranking of Heterogeneous Information Networks
ƒ
Clustering of Homogeneous Information Networks
ƒ
ƒ
LinkClus: Clustering with link‐based similarity measure
ƒ
SCAN: Density‐based clustering of networks
ƒ
Others
ƒ
Spectral clustering
ƒ
Modularity‐based clustering
ƒ
Probabilistic model‐based clustering
User‐Guided Clustering of Information Networks
11
Clustering and Ranking in Heterogeneous Information Networks
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Ranking & clustering each provides a new view over a network
Ranking globally without considering clusters → dumb
ƒ Ranking DB and Architecture confs. together?
Clustering authors in one huge cluster without distinction?
ƒ Dull to view thousands of objects (this is why PageRank!)
RankClus: Integrates clustering with ranking
ƒ Conditional ranking relative to clusters ƒ Uses highly ranked objects to improve clusters
Qualities of clustering and ranking are mutually enhanced
Y. Sun, J. Han, et al., “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09.
12
Global Ranking vs. Cluster‐Based Ranking
ƒ
A Toy Example: Two areas with 10 conferences and 100 authors in each area
13
RankClus: A New Framework
Sub-Network
Ranking
Clustering
14
The RankClus Philosophy
ƒ
Why integrated Ranking and Clustering?
ƒ
Ranking and clustering can be mutually improved
ƒ
Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster
ƒ
Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results
ƒ
Not every object should be treated equally in clustering!
ƒ
Objects preserve similarity under new measure space
ƒ
E.g., VLDB vs. SIGMOD
15
RankClus: Algorithm Framework
ƒ
Step 0. Initialization ƒ
ƒ
Step 1. Ranking
ƒ
ƒ
Randomly partition target objects into K clusters
Ranking for each sub‐network induced from each cluster, which serves as feature for each cluster
Step 2. Generating new measure space
ƒ
Estimate mixture model coefficients for each target object
ƒ
Step 3. Adjusting cluster
ƒ
Step 4. Repeating Steps 1‐3 until stable
16
Focus on a Bi‐Typed Network Case
ƒ
Conference‐author network, links can exist between
ƒ Conference (X) and author (Y)
ƒ Author (Y) and author (Y)
ƒ
Use W to denote the links and there weights
W =
17
Step 1: Ranking: Feature Extraction
ƒ
ƒ
ƒ
Two ranking strategies: Simple ranking vs. authority ranking
Simple Ranking
ƒ Proportional to degree counting for objects, e.g., # of publications of an author
ƒ Considers only immediate neighborhood in the network
Authority Ranking: Extension to HITS in weighted bi‐type network
ƒ Rule 1: Highly ranked authors publish many papers in highly ranked conferences
ƒ Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
ƒ Rule 3: The rank of an author is enhanced if he or she co‐
authors with many authors or many highly ranked authors
18
Encoding Rules in Authority Ranking
ƒ
Rule 1: Highly ranked authors publish many papers in highly ranked conferences
ƒ
Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
ƒ
Rule 3: The rank of an author is enhanced if he or she co‐
authors with many authors or many highly ranked authors
19
Example: Authority Ranking in the 2‐Area Conference‐Author Network
ƒ
The rankings of authors are quite distinct from each other in the two clusters
20
Step 2: Generate New Measure Space: A Mixture Model Method
ƒ
Consider each target object’s links are generated under a mixture distribution of ranking from each cluster
ƒ
Consider ranking as a distribution: r(Y) → p(Y)
ƒ
ƒ
Each target object xi is mapped into a K‐vector (πi,k)
ƒ
Parameters are estimated using the EM algorithm
ƒ
Maximize the log‐likelihood given all the observations of links
21
Example: 2‐D Coefficients in the 2‐Area Conference‐Author Network
ƒ
The conferences are well separated in the new measure space ƒ
Scatter plots of two conferences and component coefficients
22
Step 3: Cluster Adjustment in New Measure Space
ƒ
ƒ
ƒ
Cluster center in new measure space
ƒ Vector mean of objects in the cluster (K‐dimensional)
Cluster adjustment
ƒ Distance measure: 1‐Cosine similarity
ƒ Assign to the cluster with the nearest center
Why Better Ranking Function Derives Better Clustering?
ƒ Consider the measure space generation process
ƒ
ƒ
Highly ranked objects in a cluster play a more important role to decide a target object’s new measure
Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster
23
Step‐by‐Step Running Case Illustration
Initially, ranking
distributions are
ƒ
mixed together
Two clusters of
objects mixed
together, but
preserve similarity
somehow
Improved a little
Two clusters are
almost well
separated
Improved
significantly
Well separated
Stable
24
Time Complexity: Linear to # of Links
ƒ
At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters
ƒ
Ranking for sparse network
ƒ
ƒ
Mixture model estimation
ƒ
ƒ
~O(mK^2)
In all, linear to |E|
ƒ
ƒ
~O(K|E|+mK)
Cluster adjustment
ƒ
ƒ
~O(|E|)
~O(K|E|)
Note: SimRank will be at least quadratic at each iteration since it evaluates distance between every pair in the network
25
Case Study: Dataset: DBLP
ƒ
ƒ
ƒ
All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007
Both conference‐author relationships and co‐author relationships are used
K=15 (select only 5 clusters here)
26
NetClus: Ranking & Clustering with Star
Network Schema
ƒ
ƒ
Beyond bi‐typed information network: A Star Network Schema
Split a network into different layers, each representing by a net‐
cluster
27
StarNet: Schema & Net‐Cluster
Star Network Schema
ƒ Center type: Target type
ƒ E.g., a paper, a movie, a tagging event
ƒ A center object is a co‐occurrence of a bag of different types of objects, which stands for a multi‐relation among different types of objects
ƒ Surrounding types: Attribute (property) types
ƒ NetCluster
ƒ Given a information network G, a net‐cluster C contains two pieces of information:
ƒ Node set and link set as a sub‐network of G
ƒ Membership indicator for each node x: P(x in C)
ƒ Given a information network G, cluster number K, a clustering for G is a set of net‐clusters and for each node x, the sum of x’s probability distribution in all K net‐clusters should be 1
ƒ
28
ƒ
Venue
Author
Publish
Write
Research
Paper
Contain
Term
DBLP
29
StarNet of Delicious.com
ƒ
Web
Site
User
Tagging
Event
Contain
Tag
Delicious.com
30
StartNet for IMDB
Actor/A
ctress
Star in
Director
Direct
Movie
Contain
Title/
Plot
IMDB
31
Ranking Functions
ƒ
Ranking an object x of type Tx in a network G, denoted as p(x|Tx, G) ƒ
ƒ
Give a score to each object, according to its importance
Different rules defined different ranking functions:
ƒ
Simple Ranking
ƒ
ƒ
Ranking score is assigned according to the degree of an object
Authority Ranking
ƒ
Ranking score is assigned according to the mutual enhancement by
propagations of score through links
ƒ
“highly ranked conferences accept many good papers published by many highly ranked authors ,and highly ranked authors publish many good papers in highly ranked conferences”:
32
Ranking Function (Cont.)
ƒ
ƒ
Priors can be added:
ƒ PP(X|Tx, Gk) = (1 ‐ λP) P(X|Tx, Gk) + λP P0(X|Tx, Gk)
ƒ P0(X|Tx, Gk) is the prior knowledge, usually given as a distribution, denoted by only several words
ƒ λP is the parameter that we believe in prior distribution
Ranking distribution
ƒ Normalize ranking scores to 1, given them a probabilistic meaning
ƒ Similar to the idea of PageRank
33
NetClus: Algorithm Framework ƒ
Map each target object into a new low‐dimensional feature space according to current net‐clustering, and adjust the clustering further in the new measure space
ƒ Step 0: Generate initial random clusters
ƒ Step 1: Generate ranking‐based generative model for target objects for each net‐cluster
ƒ Step 2: Calculate posterior probabilities for target objects, which serves as the new measure, and assign target objects to the nearest cluster accordingly
ƒ Step 3: Repeat steps 1 and 2, until clusters do not change significantly
ƒ Step 4: Calculate posterior probabilities for attribute objects in each net‐cluster
34
Generative Model for Target Objects
Given a Net-cluster
ƒ
Each target object stands for an co‐occurrence of a bag of attribute objects
ƒ
Define the probability of a target object <=> define the probability of the co‐occurrence of all the associated attribute objects
ƒ
Generative probability P(d|Gk) for target object d in cluster Ck : where P(x |Tx , Gk) is ranking function, P(Tx |Gk) is type probability
ƒ
Two assumptions of independence
ƒ
The probabilities to visit objects of different types are independent to each other
ƒ
The probabilities to visit two objects within the same type are independent to each other
35
Cluster Adjustment
ƒ
ƒ
Using posterior probabilities of target objects as new feature space
ƒ Each target object => K‐dimension vector
ƒ Each net‐cluster center => K‐dimension vector
ƒ Average on the objects in the cluster
ƒ Assign each target object into nearest cluster center (e.g., cosine similarity)
A sub‐network corresponding to a new net‐cluster is then built
ƒ by extracting all the target objects in that cluster and all linked attribute objects
39
Experiments: DBLP and Beyond
ƒ
ƒ
Data Set: DBLP “all‐area” data set
ƒ All conferences + “Top” 50K authors
ƒ DBLP “four‐area” data set
ƒ 20 conferences from DB, DM, ML, IR
ƒ All authors from these conferences
ƒ All papers published in these conferences
Running case illustration
40
Accuracy Study: Experiments
ƒ
Accuracy, compared with PLSA, a pure text model, no other types of objects and links are used, use the same prior as in NetClus
Accuracy of Paper Clustering Results
ƒ
Accuracy, compared with RankClus, a bi‐typed clustering method on only one type
Accuracy of Conference Clustering Results
41
NetClus: Distinguishing Conferences
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
AAAI 0.0022667 0.00899168 0.934024 0.0300042 0.0247133 CIKM 0.150053 0.310172 0.00723807 0.444524 0.0880127 CVPR 0.000163812 0.00763072 0.931496 0.0281342 0.032575 ECIR 3.47023e‐05 0.00712695 0.00657402 0.978391 0.00787288 ECML 0.00077477 0.110922 0.814362 0.0579426 0.015999 EDBT 0.573362 0.316033 0.00101442 0.0245591 0.0850319 ICDE 0.529522 0.376542 0.00239152 0.0151113 0.0764334 ICDM 0.000455028 0.778452 0.0566457 0.113184 0.0512633 ICML 0.000309624 0.050078 0.878757 0.0622335 0.00862134 IJCAI 0.00329816 0.0046758 0.94288 0.0303745 0.0187718 KDD 0.00574223 0.797633 0.0617351 0.067681 0.0672086 PAKDD 0.00111246 0.813473 0.0403105 0.0574755 0.0876289 PKDD 5.39434e‐05 0.760374 0.119608 0.052926 0.0670379 PODS 0.78935 0.113751 0.013939 0.00277417 0.0801858 SDM 0.000172953 0.841087 0.058316 0.0527081 0.0477156 SIGIR 0.00600399 0.00280013 0.00275237 0.977783 0.0106604 SIGMOD 0.689348 0.223122 0.0017703 0.00825455 0.0775055 VLDB 0.701899 0.207428 0.00100012 0.0116966 0.0779764 WSDM 0.00751654 0.269259 0.0260291 0.683646 0.0135497 WWW 0.0771186 0.270635 0.029307 0.451857 0.171082 42
NetClus: Database System Cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Ranking authors in XML
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
43
NetClus: StarNet‐Based Ranking and Clustering
ƒ
ƒ
ƒ
ƒ
ƒ
A general framework in which ranking and clustering are successfully combined to analyze infornets
Ranking and clustering can mutually reinforce each other in information network analysis
NetClus, an extension to RankClus that integrates ranking and clustering and generate net‐clusters in a star network with arbitrary number of types
Net‐cluster, heterogeneous information sub‐networks comprised of multiple types of objects
Go well beyond DBLP, and structured relational DBs
Flickr: query “Raleigh” derives multiple clusters
44
iNextCube: Information Network‐Enhanced Text
Cube (VLDB’09 Demo)
Demo: iNextCube.cs.uiuc.edu
Dimension hierarchies generated by NetClus
Architecture of iNextCube
Author/conferen
term ranking for
research area. Th
research areas ca
at different level
All
DB and IS
Theory
DB
DM
XML
Distributed DB …
Architecture …
IR
Net‐cluster Hierarchy
45
Clustering and Ranking in Information Networks
ƒ
Integrated Clustering and Ranking of Heterogeneous Information Networks
ƒ
Clustering of Homogeneous Information Networks
ƒ
ƒ
LinkClus: Clustering with link‐based similarity measure
ƒ
SCAN: Density‐based clustering of networks
ƒ
Others
ƒ
Spectral clustering
ƒ
Modularity‐based clustering
ƒ
Probabilistic model‐based clustering
User‐Guided Clustering of Information Networks
46
Link‐Based Clustering: Why Useful?
Authors
Tom
Proceedings
sigmod03
sigmod04
Mike
Cathy
John
sigmod
sigmod05
vldb03
vldb04
vldb
vldb05
aaai04
Mary
Conferences
aaai05
aaai
Questions:
Q1: How to cluster each type of objects?
Q2: How to define similarity between each type of objects?
47
SimRank: Link‐Based Similarities
ƒ
Two objects are similar if linked with the same or similar objects
Jeh & Widom, 2002 ‐ SimRank
Tom
sigmod03
sigmod04
Mary
Tom
Mike
Cathy
John
sigmod
Similarity between two objects a and b, S(a, b) = the average similarity between objects linked with a and those with b:
sigmod
where I(v) is the set of in‐neighbors of the vertex v.
sigmod05
sigmod03
sigmod04
sigmod05
But: It is expensive to compute:
vldb03
vldb04
vldb05
vldb
For a dataset of N objects and M links, it takes O(N2) space and O(M2) time to compute all similarities.
48
Observation 1: Hierarchical Structures
ƒ
Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals)
Relationships between articles and words (Chakrabarti, Papadimitriou, Modha, Faloutsos, 2004)
A hierarchical structure of products in Walmart
grocery electronics
TV
DVD
apparel
Articles
All
camera
Words
49
Observation 2: Distribution of Similarity
portion of entries
0.4
Distribution of SimRank similarities
among DBLP authors
0.3
0.2
0.1
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
similarity value
ƒ
Power law distribution exists in similarities
ƒ 56% of similarity entries are in [0.005, 0.015]
ƒ 1.4% of similarity entries are larger than 0.1
ƒ Our goal: Design a data structure that stores the significant similarities and compresses insignificant ones
50
Our Data Structure: SimTree
Each non-leaf node
represents a group of
similar lower-level nodes
Similarity between two
sibling nodes n1 and n2
0.2
n1
Adjustment ratio
for node n7
0.8
0.9
n7
n4
0.3
0.8
n2
0.9
0.9
n5
n6
n8
n3
1.0
n9
Each leaf node
simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)
represents an object
ƒ Path-based node similarity
ƒ Similarity between two nodes is the average similarity between objects
linked with them in other SimTrees
Average similarity between x and all other nodes
ƒ Adjustment ratio for x =
Average similarity between x’s parent and all
ƒ
other nodes
51
LinkClus: SimTree‐Based Hierarchical Clustering
ƒ
Initialize a SimTree for objects of each type
ƒ
Repeat
ƒ
For each SimTree, update the similarities between its nodes using similarities in other SimTrees
ƒ
ƒ
Similarity between two nodes a and b is the average similarity between objects linked with them
Adjust the structure of each SimTree
ƒ
Assign each node to the parent node that it is most similar to
52
Initialization of SimTrees
ƒ
Finding tight groups Frequent pattern mining
Reduced to
The tightness of a group of nodes is the support of a frequent pattern
g1
n2
g2
ƒ
n1
n3
n4
Transactions
1
2
3
4
5
6
7
8
9
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
Initializing a tree:
ƒ Start from leaf nodes (level‐0)
ƒ At each level l, find non‐overlapping groups of similar nodes with frequent pattern mining
53
Complexity: LinkClus vs. SimRank
ƒ
ƒ
After initialization, iteratively (1) for each SimTree update the similarities between its nodes using similarities in other SimTrees, and (2) Adjust the structure of each SimTree
Computational complexity: ƒ For two types of objects, N in each, and M linkages between them
Time
Space
Updating similarities
O(M(logN)2)
O(M+N)
Adjusting tree structures
O(N)
O(N)
LinkClus
O(M(logN)2)
O(M+N)
SimRank
O(M2)
O(N2)
54
Performance Comparison: Experiment Setup
ƒ
DBLP dataset: 4170 most productive authors, and 154 well-known
conferences with most proceedings
ƒ Manually labeled research areas of 400 most productive authors
according to their home pages (or publications)
ƒ Manually labeled areas of 154 conferences according to their call for
papers
ƒ Approaches Compared:
ƒ SimRank (Jeh & Widom, KDD 2002)
ƒ Computing pair-wise similarities
ƒ SimRank with FingerPrints (F-SimRank)
ƒ Fogaras & R´acz, WWW 2005
ƒ pre-computes a large sample of random paths from each object
and uses samples of two objects to estimate SimRank similarity
ƒ ReCom (Wang et al. SIGIR 2003)
ƒ Iteratively clustering objects using cluster labels of linked objects
55
DBLP Data Set: Accuracy and Computation Time
Conferences
Authors
0.8
1
0.7
0.3
LinkClus
SimRank
ReCom
F-SimRank
0.2
#iteration
13
11
9
7
5
3
1
19
17
15
13
11
9
7
5
3
0.1
19
0.8
0.4
17
LinkClus
SimRank
ReCom
F-SimRank
0.85
0.5
15
accuracy
0.6
0.9
1
accuracy
0.95
#iteration
Approaches
Accr‐Author
Accr‐Conf
average time
LinkClus
0.957
0.723
76.7
SimRank
0.958
0.760
1020
ReCom
0.907
0.457
43.1
F‐SimRank
0.908
0.583
83.6
56
Email Dataset: Accuracy and Time
ƒ
F. Nielsen. Email dataset. http://www.imm.dtu.dk/∼rem/data/Email‐1431.zip
ƒ
370 emails on conferences, 272 on jobs, and 789 spam emails
ƒ
Why is LinkClus even better than SimRank in accuracy? ƒ
Noise filtering due to frequent pattern‐based preprocessing
Approach
Accuracy
Total time (sec)
LinkClus
0.8026
1579.6
SimRank
0.7965
39160
ReCom
0.5711
74.6
F‐SimRank
0.3688
479.7
CLARANS 0.4768
8.55
57
Clustering and Ranking in Information Networks
ƒ
Integrated Clustering and Ranking of Heterogeneous Information Networks
ƒ
Clustering of Homogeneous Information Networks
ƒ
ƒ
LinkClus: Clustering with link‐based similarity measure
ƒ
SCAN: Density‐based clustering of networks
ƒ
Others
ƒ
Spectral clustering
ƒ
Modularity‐based clustering
ƒ
Probabilistic model‐based clustering
User‐Guided Clustering of Information Networks
58
SCAN: Density‐Based Network Clustering
ƒ
Networks made up of the mutual relationships of data elements usually have an underlying structure. Clustering: A structure discovery problem
ƒ
Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)? ƒ
Questions to be answered: How many clusters? What size should they be? What is the best partitioning? Should some points be segregated?
ƒ
Scan: An interesting density‐based algorithm
ƒ
X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007
59
Social Network and Its Clustering Problem
ƒ
Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group.
ƒ
Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups.
ƒ
Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group.
60
Structure Similarity
ƒ
Define Γ(v) as immediate neighbor of a vertex v.
ƒ
The desired features tend to be captured by a measure σ(u, v) as Structural Similarity
| Γ(v) I Γ( w) |
σ (v, w) =
| Γ(v) || Γ( w) |
ƒ
v
Structural similarity is large for members of a clique and small for hubs and outliers.
61
Structural Connectivity [1]
N ε (v ) = {w ∈ Γ(v) | σ (v, w) ≥ ε }
ƒ
ε‐Neighborhood:
ƒ
Core:
ƒ
Direct structure reachable:
COREε , μ (v) ⇔| N ε (v) |≥ μ
DirRECH ε , μ (v, w) ⇔ CORE ε , μ (v ) ∧ w ∈ N ε (v )
ƒ
Structure reachable: transitive closure of direct structure reachability
ƒ
Structure connected:
CONNECTε , μ (v, w) ⇔ ∃u ∈ V : RECH ε , μ (u , v ) ∧ RECH ε , μ (u , w)
[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases
62
Structure‐Connected Clusters
ƒ
ƒ
ƒ
Structure‐connected cluster C
ƒ
Connectivity:
ƒ
Maximality:
∀v, w ∈ C : CONNECT ε , μ (v, w)
∀v, w ∈ V : v ∈ C ∧ REACH ε , μ (v, w) ⇒ w ∈ C
Hubs:
ƒ
Not belong to any cluster
ƒ
Bridge to many clusters
Outliers:
ƒ
Not belong to any cluster
ƒ
Connect to less clusters
hub
outlier
63
Algorithm
2
3
μ=2
ε = 0.7
5
1
4
0.67
8
7
0.82
12
0.75
6
11
0
10
9
13
64
Algorithm
2
3
μ=2
ε = 0.7
5
0.51
0.68
7
6
11
8
1
4
0.51
0
12
10
9
13
65
Running Time
ƒ
Running time: O(|E|)
ƒ
For sparse networks: O(|V|)
[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).
66
Clustering and Ranking in Information Networks
ƒ
Integrated Clustering and Ranking of Heterogeneous Information Networks
ƒ
Clustering of Homogeneous Information Networks
ƒ
ƒ
LinkClus: Clustering with link‐based similarity measure
ƒ
SCAN: Density‐based clustering of networks
ƒ
Others
ƒ
Spectral clustering
ƒ
Modularity‐based clustering
ƒ
Probabilistic model‐based clustering
User‐Guided Clustering of Information Networks
67
Spectral Clustering
Spectral clustering: Find the best cut that partitions the network
ƒ Different criteria to decide “best”
ƒ Min cut, ratio cut, normalized cut, Min‐Max cut
ƒ Using Min cut as an example [Wu et al. 1993]
ƒ Assign each node i an indicator variable
ƒ
Represent the cut size using indicator vector and adjacency matrix
ƒ Cutsize = ƒ Minimize the objective function through solving eigenvalue system
ƒ Relax the discrete value of q to continuous value
ƒ
ƒ
ƒ
Map continuous value of q into discrete ones to get cluster labels
ƒ Use second smallest eigenvector for q
ƒ
68
Modularity‐Based Clustering
Modularity‐based clustering
ƒ Find the best clustering that maximizes the modularity function
ƒ Q‐function [Newman et al., 2004]
ƒ Let eij be a half of the fraction of edges between group i and group j ƒ eii is the fraction of edges within group i
ƒ Let ai be the fraction of all ends of edges attached to vertices in group I
ƒ
ƒ
ƒ
Q is then defined as sum of difference between within‐group edges and expected within‐group edges
ƒ
ƒ
Minimize Q
ƒ One possible solution: hierarchically merge clusters resulting in greatest increase in Q function [Newman et al., 2004]
69
Probabilistic Model‐Based Clustering
Probabilistic model‐based clustering
ƒ Build generative models for links based on hidden cluster labels
ƒ Maximize the log‐likelihood of all the links to derive the hidden cluster membership
ƒ An example: Airoldi et al., Mixed membership stochastic block models, 2008
ƒ Define a group interaction probability matrix B (KXK)
ƒ B(g,h) denotes the probability of link generation between group g and group h
ƒ Generative model for a link
ƒ For each node, draw a membership probability vector form a Dirichlet prior
ƒ For each paper of nodes, draw cluster labels according to their membership probability (supposing g and h), then decide whether to have a link according to probability B(g, h)
ƒ Derive the hidden cluster label by maximize the likelihood given B and the prior
ƒ
70
Clustering and Ranking in Information Networks
ƒ
Integrated Clustering and Ranking of Heterogeneous Information Networks
ƒ
Clustering of Homogeneous Information Networks
ƒ
ƒ
LinkClus: Clustering with link‐based similarity measure
ƒ
SCAN: Density‐based clustering of networks
ƒ
Others
ƒ
Spectral clustering
ƒ
Modularity‐based clustering
ƒ
Probabilistic model‐based clustering
User‐Guided Clustering of Information Networks
71
User‐Guided Clustering in DB InfoNet
Open-course
Course
Work-In
Professor
person
name
course
course-id
group
office
semester
name
position
instructor
area
Advise
Group
professor
name
student
area
degree
User hint
Target of
clustering
Publish
Publication
author
title
title
year
conf
Register
student
Student
course
name
office
semester
position
unit
grade
User usually has a goal of clustering, e.g., clustering students by research area
ƒ User specifies his clustering goal to a DB‐InfoNet cluster: CrossClus
ƒ
72
Classification vs. User‐Guided Clustering
User hint
ƒ
User‐specified feature (in the form of attribute) is used as a hint, not class labels
ƒ
The attribute may contain too many or too few distinct values
ƒ
All tuples for clustering
ƒ
E.g., a user may want to cluster students into 20 clusters instead of 3
Additional features need to be included in cluster analysis
73
User‐Guided Clustering vs. Semi‐supervised Clustering
Semi‐supervised clustering [Wagstaff, et al’ 01, Xing, et al.’02]
ƒ User provides a training set consisting of “similar” and “dissimilar”
pairs of objects
ƒ User‐guided clustering
ƒ User specifies an attribute as a hint, and more relevant features are found for clustering
Semi-supervised clustering
User-guided clustering
ƒ
x
All tuples for clustering
All tuples for clustering
74
Why Not Typical Semi‐Supervised Clustering?
Why not do typical semi‐supervise clustering?
ƒ Much information (in multiple relations) is needed to judge whether two tuples are similar
ƒ A user may not be able to provide a good training set
ƒ It is much easier for a user to specify an attribute as a hint, such as a student’s research area
ƒ
Tom Smith
Jane Chang
SC1211
BI205
TA
RA
Tuples to be compared
User hint
75
CrossClus: An Overview
ƒ
ƒ
CrossClus: Framework
ƒ
Search for good multi‐relational features for clustering
ƒ
Measure similarity between features based on how they cluster objects into groups
ƒ
User guidance + heuristic search for finding pertinent features
ƒ
Clustering based on a k‐medoids‐based algorithm CrossClus: Major advantages
ƒ
User guidance, even in a very simple form, plays an important role in multi‐relational clustering
ƒ
CrossClus finds pertinent features by computing similarities between features
76
Selection of Multi‐Relational Features
ƒ
ƒ
A multi‐relational feature is defined by: ƒ A join path. E.g., Student → Register → OpenCourse → Course
ƒ An attribute. E.g., Course.area
ƒ (For numerical feature) an aggregation operator. E.g., sum or average
Categorical Feature f = [Student → Register → OpenCourse → Course, Course.area, null]
areas of courses of each student
Tuple
t1
t2
t3
t4
t5
„
Areas of courses
Values of feature f
Feature f
Tuple
DB
AI
TH
5
0
1
5
3
5
3
5
0
3
0
7
4
5
4
t1
t2
t3
t4
t5
DB
AI
TH
0.5
0
0.1
0.5
0.3
0.5
0.3
0.5
0
0.3
0
0.7
0.4
0.5
0.4
f(t1)
f(t2)
f(t3)
f(t4)
DB
AI
TH
f(t5)
Numerical Feature, e.g., average grades of students
„ h = [Student → Register, Register.grade, average]
„ E.g. h(t1) = 3.5
77
Similarity Between Features
Vf
Values of Feature f and g
Feature f (course)
DB
t1
t2
0.5
0
AI
0.5
0.3
TH
0
0.7
Feature g (group)
Info sys
Cog sci
1
0
0
0
1
0.9
0.5-0.6
0.8
0.4-0.5
Theory
0.7
0.3-0.4
0.6
0.5
0
0.4
0.3
0.2
0.1
0
1
0.1
0.5
0.4
0
0.5
0.5
0.5
0
0.5
0.5
0
0.5
t5
0.3
0.3
0.4
0.5
0.5
0
Similarity between two features –
cosine similarity of two vectors
V ⋅V
Sim( f , g ) = f g
V V
f
g
0-0.1
3
2
S4
S3
1
S2
t4
0.1-0.2
4
S5
t3
0.2-0.3
5
S1
Vg
1
0.9
0.8
0.5-0.6
0.7
0.4-0.5
0.6
0.3-0.4
0.5
0.2-0.3
0.4
1
0.3
0.2
0.1-0.2
0-0.1
2
0.1
3
0
S1
4
S2
S3
S4
5
S5
78
Similarity between Categorical & Numerical Features
V ⋅V
h
= 2 ∑ ∑ sim h (t i , t j )⋅ sim f (ti , t j )
N
f
i =1 j < i
N
l
= 2∑ ∑
i =1 k =1
N
l
⎛
⎞
⎛
⎞
f (ti ). p k (1 − h (t i ))⎜⎜ ∑ f (t j ). p k ⎟⎟ + 2 ∑ ∑ f (t i ). p k ⎜⎜ ∑ h (t j )⋅ f (t j ). p k ⎟⎟
i =1 k =1
⎝ j <i
⎠
⎝ j <i
⎠
Only depend on ti
Feature f
DB
AI
TH
Objects
(ordered by h) Feature h
2.7
2.9
3.1
3.3
3.5
3.7
3.9
Depend on all tj with j<i
Parts depending on ti
Parts depending on all ti with j<i
79
Searching for Pertinent Features
ƒ
Different features convey different aspects of information
Research area
Research group area
Demographic info
GPA
Conferences of papers
Permanent address
GRE score
Nationality
Number of papers
Advisor
ƒ
ƒ
Academic Performances
Features conveying same aspect of information usually cluster objects in more similar ways
ƒ research group areas vs. conferences of publications
Given user specified feature
ƒ Find pertinent features by computing feature similarity
80
Heuristic Search for Pertinent Features
Overall procedure
Course
Professor
person
name
course
course-id
group
office
semester
name
position
instructor
area
1.Start from the user‐ specified feature
Group
2. Search in neighborhood of name
existing pertinent features
3. Expand search range gradually area
2
Advise
student
degree
author
1
title
Publication
title
year
conf
Register
student
Student
Target of
clustering
Publish
professor
User hint
„
Open-course
Work-In
name
office
position
course
semester
unit
grade
Tuple ID propagation [Yin, et al.’04] is used to create multi-relational features
„ IDs of target tuples can be propagated along any join path, from which we
can find tuples joinable with each target tuple
81
Clustering with Multi‐Relational Feature
ƒ
Given a set of L pertinent features f1, …, fL, similarity between two objects
L
sim (t1 , t 2 ) = ∑ sim f i (t1 , t 2 ) ⋅ f i .weight
i =1
ƒ
Weight of a feature is determined in feature search by its similarity with other pertinent features
ƒ
For clustering, we use CLARANS, a scalable k‐medoids [Ng & Han’94] algorithm
82
Experiments: Compare CrossClus with Existing Methods
ƒ
ƒ
ƒ
Baseline: Only use the user specified feature
PROCLUS [Aggarwal, et al. 99]: a state‐of‐the‐art subspace clustering algorithm
ƒ Use a subset of features for each cluster
ƒ We convert relational database to a table by propositionalization
ƒ User‐specified feature is forced to be used in every cluster
RDBC [Kirsten and Wrobel’00]
ƒ A representative ILP clustering algorithm
ƒ Use neighbor information of objects for clustering
ƒ User‐specified feature is forced to be used
83
Measuring Clustering Accuracy
To verify that CrossClus captures user’s clustering goal, we define “accuracy”
of clustering
ƒ Given a clustering task
ƒ Manually find all features that contain information directly related to the clustering task – standard feature set
ƒ E.g., Clustering students by research areas
ƒ Standard feature set: research group, group areas, advisors, conferences of publications, course areas
ƒ Accuracy of clustering result: how similar it is to the clustering generated by standard feature set
ƒ
∑
deg (C ⊂ C ') =
n
i =1
(
max 1≤ j ≤ n ' ci ∩ c ' j
∑
n
i =1
)
ci
deg (C ⊂ C ') + deg (C ' ⊂ C )
sim (C , C ') =
2
84
Clustering Professors: CS Dept Dataset
Clustering Accuracy - CS Dept
1
0.8
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
0.6
0.4
0.2
0
Group
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Course
Group+Course
(Theory): J. Erickson, S. Har‐Peled, L. Pitt, E. Ramos, D. Roth, M. Viswanathan
(Graphics): J. Hart, M. Garland, Y. Yu
(Database): K. Chang, A. Doan, J. Han, M. Winslett, C. Zhai
(Numerical computing): M. Heath, T. Kerkhoven, E. de Sturler
(Networking & QoS): R. Kravets, M. Caccamo, J. Hou, L. Sha
(Artificial Intelligence): G. Dejong, M. Harandi, J. Ponce, L. Rendell
(Architecture): D. Padua, J. Torrellas, C. Zilles, S. Adve, M. Snir, D. Reed, V. Adve
(Operating Systems): D. Mickunas, R. Campbell, Y. Zhou
85
DBLP Dataset
Clustering Accurarcy - DBLP
1
0.9
0.8
0.7
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
0.6
0.5
0.4
0.3
0.2
0.1
th
re
e
A
ll
th
or
+C
oa
u
W
or
d
Co
au
th
or
or
d
Co
nf
+
Co
nf
+
W
or
Co
au
th
or
d
W
Co
nf
0
86
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
87
Classification of Information Networks
ƒ
ƒ
Classification of Heterogeneous Information Networks:
ƒ
Graph‐regularization‐Based Method (GNetMine)
ƒ
Multi‐Relational‐Mining‐Based Method (CrossMine)
ƒ
Statistical Relational Learning‐Based Method (SRL)
Classification of Homogeneous Information Networks
88
Why Classifying Heterogeneous InfoNet?
Sometimes, we do have prior knowledge for part of the nodes/objects!
‰ Input: Heterogeneous information network structure + class labels for some objects/nodes
‰ Goal: Classify the heterogeneous networked data into classes, each of which is composed of multi‐typed data objects sharing a common topic. ƒ Natural generalization of classification on homogeneous networked data
‰
Find out the terrorists, their emails and frequently used words!
Class: terrorism
Email network + several suspicious users/words/emails
Military network + which military camp several soldiers/commanders belong to
……
Classifier
Find out the soldiers and commanders belonging to that camp!
Class: military camp
……
89
Classification: Knowledge Propagation
90
GNetMine: Methodology
‰
‰
‰
Classification of networked data can be essentially viewed as a process of knowledge propagation, where information is propagated from labeled objects to unlabeled ones through links until a stationary state is achieved.
A novel graph‐based regularization framework to address the classification problem on heterogeneous information networks.
Respect the link type differences by preserving consistency over each relation graph corresponding to each type of links separately
‰ Mathematical intuition: Consistency assumption
ƒ The confidence (f)of two objects (xip and xjq) belonging to class k should be similar if xip ↔ xjq (Rij,pq > 0)
ƒ
f should be similar to the given ground truth
91
GNetMine: Graph‐Based Regularization
‰
Minimize the objective function
User preference: how much do you J ( f1( k ) ,..., f m( k ) )
nj
value this relationship / ground truth?
1
1
(k )
f ip −
f jq( k ) ) 2
= ∑ λij ∑∑ Rij , pq (
Dij , pp
D ji ,qq
i , j =1
p =1 q =1
ni
m
m
+ ∑ α i ( f i ( k ) − y i( k ) )T ( f i ( k ) − y i( k ) )
i =1
Smoothness constraints: objects linked together should share similar estimations of confidence belonging to class k
Normalization term applied to each type of link separately: reduce the impact of popularity of nodes
Confidence estimation on labeled data and their pre‐given labels should be similar
92
Experiments on DBLP
‰
‰
‰
‰
Class: Four research areas (communities)
ƒ Database, data mining, AI, information retrieval
Four types of objects
ƒ Paper (14376), Conf. (20), Author (14475), Term (8920)
Three types of relations
ƒ Paper‐conf., paper‐author, paper‐term
Algorithms for comparison
ƒ Learning with Local and Global Consistency (LLGC) [Zhou et al. NIPS 2003] – also the homogeneous version of our method
ƒ Weighted‐vote Relational Neighbor classifier (wvRN) [Macskassy et al. JMLR 2007]
ƒ Network‐only Link‐based Classification (nLB) [Lu et al. ICML 2003, Macskassy et al. JMLR 2007]
93
Classification Accuracy: Labeling a Very Small Portion of Authors and Papers
(a%, p%)
(0.1%, 0.1%)
(0.2%, 0.2%)
(0.3%, 0.3%)
(0.4%, 0.4%)
(0.5%, 0.5%)
nLB
A‐A
25.4
28.3
28.4
30.7
29.8
A‐C‐P‐T
26.0
26.0
27.4
26.7
27.3
A‐A
40.8
46.0
48.6
46.3
49.0
wvRN
A‐C‐P‐T
34.1
41.2
42.5
45.6
51.4
A‐A
41.4
44.7
48.8
48.7
50.6
LLGC
A‐C‐P‐T
61.3
62.2
65.7
66.0
68.9
GNetMine
A‐C‐P‐T
82.9
83.4
86.7
87.2
87.5
Comparison of classification accuracy on authors (%)
(a%, p%)
(0.1%, 0.1%)
(0.2%, 0.2%)
(0.3%, 0.3%)
(0.4%, 0.4%)
(0.5%, 0.5%)
nLB
P‐P
49.8
73.1
77.9
79.1
80.7
A‐C‐P‐T
31.5
40.3
35.4
38.6
39.3
P‐P
62.0
71.7
77.9
78.1
77.9
wvRN
A‐C‐P‐T
42.0
49.7
54.3
54.4
53.5
P‐P
67.2
72.8
76.8
77.9
79.0
LLGC
A‐C‐P‐T
62.7
65.5
66.6
70.5
73.5
GNetMine
A‐C‐P‐T
79.2
83.5
83.2
83.7
84.1
Comparison of classification accuracy on papers (%)
(a%, p%)
(0.1%, 0.1%)
(0.2%, 0.2%)
(0.3%, 0.3%)
(0.4%, 0.4%)
(0.5%, 0.5%)
nLB
A‐C‐P‐T
25.5
22.5
25.0
25.0
25.0
wvRN
A‐C‐P‐T
43.5
56.0
59.0
57.0
68.0
LLGC
A‐C‐P‐T
79.0
83.5
87.0
86.5
90.0
GNetMine
A‐C‐P‐T
81.0
85.0
87.0
89.5
94.0
Comparison of classification accuracy on conferences(%)
94
Knowledge Propagation: List Objects with the Highest Confidence Measure Belonging to Each Class
No.
Database
Data Mining
Artificial Intelligence
Information Retrieval
1
data
mining
learning
retrieval
2
database
data
knowledge
information
3
query
clustering
Reinforcement
web
4
5
system
xml
learning
classification
reasoning
model
search
document
Top‐5 terms related to each area
No.
1
2
3
4
5
Database
Surajit Chaudhuri
H. V. Jagadish
Michael J. Carey
Michael Stonebraker
C. Mohan
Data Mining
Jiawei Han
Philip S. Yu
Christos Faloutsos
Wei Wang
Shusaku Tsumoto
Artificial Intelligence
Sridhar Mahadevan
Takeo Kanade
Andrew W. Moore
Satinder P. Singh
Thomas S. Huang
Information Retrieval
W. Bruce Croft
Iadh Ounis
Mark Sanderson
ChengXiang Zhai
Gerard Salton
Top‐5 authors concentrated in each area
No.
1
2
3
4
5
Database
VLDB
SIGMOD
PODS
ICDE
EDBT
Data Mining
KDD
SDM
PAKDD
ICDM
PKDD
Artificial Intelligence
IJCAI
AAAI
CVPR
ICML
ECML
Information Retrieval
SIGIR
ECIR
WWW
WSDM
CIKM
Top‐5 conferences concentrated in each area
95
Classification of Information Networks
ƒ
ƒ
Classification of Heterogeneous Information Networks:
ƒ
Graph‐regularization‐Based Method (GNetMine)
ƒ
Multi‐Relational‐Mining‐Based Method (CrossMine)
ƒ
Statistical Relational Learning‐Based Method (SRL)
Classification of Homogeneous Information Networks
96
Multi‐Relation to Flat Relation Mining?
ƒ
Folding multiple relations into a single “flat” one for mining?
Doctor
„
Patient
Contact
flatten
Cannot be a solution due to problems:
„ Lose information of linkages and relationships, no semantics preservation
„ Cannot utilize information of database structures or schemas (e.g., E‐R modeling)
97
One Approach: Inductive Logic Programming (ILP)
ƒ
ƒ
ƒ
Find a hypothesis that is consistent with background knowledge (training data)
ƒ FOIL, Golem, Progol, TILDE, …
Background knowledge
ƒ Relations (predicates), Tuples (ground facts)
Inductive Logic Programming (ILP) ƒ Hypothesis: The hypothesis is usually a set of rules, which can predict certain attributes in certain relations
ƒ Daughter(X, Y) ← female(X), parent(Y, X)
Training examples
Daughter(mary, ann)
Daughter(eve, tom)
Daughter(tom, ann)
Daughter(eve, ann)
Background knowledge
+
Parent(ann, mary)
+
Parent(ann, tom)
–
Parent(tom, eve)
–
Parent(tom, ian)
Female(ann)
Female(mary)
Female(eve)
98
Inductive Logic Programming Approach to Multi‐
Relation Classification
ƒ
ILP Approached to Multi‐Relation Classification
ƒ
Top‐down Approaches (e.g., FOIL)
while(enough examples left)
generate a rule
remove examples satisfying this rule
ƒ
Bottom‐up Approaches (e.g., Golem)
Use each example as a rule
Generalize rules by merging rules
ƒ
ƒ
Decision Tree Approaches (e.g., TILDE)
ILP Approach: Pros and Cons
ƒ
Advantages: Expressive and powerful, and rules are understandable
ƒ
Disadvantages: Inefficient for databases with complex schemas, and inappropriate for continuous attributes
99
FOIL: First‐Order Inductive Learner (Rule Generation) ƒ
Find a set of rules consistent with training data
ƒ
A top‐down, sequential covering learner
ƒ
Build each rule by heuristics
ƒ
Foil gain – a special type of information gain
Examples covered
by Rule 2
Examples covered
Examples covered
by Rule 1
by Rule 3
All examples
ƒ
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
A3=1
Positive examples
Negative examples
To generate a rule
while(true)
find the best predicate p
if foil‐gain(p)>threshold then add p to current rule
else break
100
Find the Best Predicate: Predicate Evaluation
ƒ
ƒ
All predicates in a relation can be evaluated based on propagated IDs
Use foil‐gain to evaluate predicates
ƒ Suppose current rule is r. For a predicate p,
foil‐gain(p) =
ƒ
ƒ
⎤
⎡
P(r )
P(r + p )
P(r + p )× ⎢− log
+ log
⎥
(
)
(
)
(
)
(
)
P
r
+
N
r
P
r
p
N
r
p
+
+
+
⎦
⎣
Categorical Attributes
ƒ Compute foil‐gain directly
Numerical Attributes
ƒ Discretize with every possible value
101
Loan Applications: Backend Database
Loan
Target relation: Each tuple has a class label, indicating whether a loan is paid on time.
loan-id
account-id
date
Account
District
account-id
district-id
district-id
dist-name
frequency
date
amount
duration
payment
Transaction
trans-id
Card
region
card-id
#people
disp-id
#lt-500
type
#lt-2000
issue-date
#lt-10000
#gt-10000
account-id
date
Order
order-id
account-id
bank-to
account-to
amount
type
#city
Disposition
type
disp-id
operation
account-id
amount
client-id
balance
symbol
ratio-urban
avg-salary
unemploy95
unemploy96
den-enter
Client
client-id
#crime95
#crime96
birth-date
gender
district-id
How to make decisions to loan applications?
102
CrossMine: An Effective Multi‐relational Classifier
ƒ
Methodology
ƒ
Tuple‐ID propagation: an efficient and flexible method for virtually joining relations
ƒ
Confine the rule search process in promising directions
ƒ
Look‐one‐ahead: a more powerful search strategy
ƒ
Negative tuple sampling: improve efficiency while maintaining accuracy
103
Tuple ID Propagation
Loan ID
Applicant #1
Applicant #3
Amount
Duration
Decision
1
124
1000
12
Yes
2
124
4000
12
Yes
3
108
10000
24
No
4
45
12000
36
No
Account ID
Applicant #2
Account ID
Frequency
Open date
Propagated ID
Labels
124
monthly
02/27/93
1, 2
2+, 0–
108
weekly
09/23/97
3
0+, 1–
45
monthly
12/09/96
4
0+, 1–
67
weekly
01/01/97
Null
0+, 0–
Possible predicates:
•Frequency=‘monthly’: 2 +, 1 –
•Open date < 01/01/95: 2 +, 0 –
Applicant #4
„
„
Propagate tuple IDs of target relation to non‐target relations
Virtually join relations to avoid the high cost of physical joins
104
Tuple ID Propagation (Idea Outlined)
ƒ
ƒ
Efficient
ƒ Only propagate the tuple IDs
ƒ Time and space usage is low
Flexible
ƒ Can propagate IDs among non‐target relations
ƒ Many sets of IDs can be kept on one relation, which are propagated from different join paths
Target Relation
R1
R2
R3
105
Rule Generation: Example
Target relation
Loan
loan‐id
account‐id
date
Rule Generation
Start at the target relation
Repeat
ƒ Search in all active relations
ƒ Search in all relations joinable to active relations
ƒ Add the best predicate to the current rule
ƒ Set the involved relation to active
Until
ƒ The best predicate does not have enough gain
ƒ Current rule is too long
amount
duration
payment
Account
District
account‐id
district‐id
district‐id
dist‐name
frequency
Card
region
date
card‐id
#people
disp‐id
#lt‐500
type
#lt‐2000
issue‐date
#lt‐10000
#gt‐10000
First predicate
Transaction
trans‐id
account‐id
date
Order
order‐id
account‐id
bank‐to
account‐to
amount
type
#city
Disposition
type
disp‐id
operation
account‐id
amount
client‐id
balance
symbol
ratio‐urban
avg‐salary
unemploy95
unemploy96
Second predicate
den‐enter
Client
client‐id
#crime95
#crime96
birth‐date
gender
district‐id
106
Look‐one‐ahead in Rule Generation
Two types of relations: Entity and Relationship
Often cannot find useful predicates on relations of relationship
ƒ
ƒ
No good predicate
Target Relation
„
Solution of CrossMine:
„ When propagating IDs to a relation of relationship, propagate one more step to next relation of entity
107
Negative Tuple Sampling
ƒ
ƒ
ƒ
Each time a rule is generated, covered positive examples are removed After generating many rules, there are much less positive examples than negative ones
ƒ Cannot build good rules (low support)
ƒ Still time consuming (large number of negative examples)
Solution: Sampling on negative examples
ƒ Improve efficiency without affecting rule quality
+
– – –+–++ –
+ – + +– + – +–
– – – + – ++ – –
+ +
–+ + –
– –––
–
– ––
+–
–
– –– –
–
++
–+ + –
108
Real Dataset
ƒ
„
PKDD Cup 99 dataset – Loan Application
Accuracy
Time (per fold)
FOIL
74.0%
3338 sec
TILDE
81.3%
2429 sec
CrossMine
90.7%
15.3 sec
Mutagenesis dataset (4 relations): Only 4 relations, so TILDE does a good job, though slow
Accuracy
Time (per fold)
FOIL
79.7%
1.65 sec
TILDE
89.4%
25.6 sec
CrossMine
87.7%
0.83 sec
109
Classification of Information Networks
ƒ
ƒ
Classification of Heterogeneous Information Networks:
ƒ
Graph‐regularization‐Based Method (GNetMine)
ƒ
Multi‐Relational‐Mining‐Based Method (CrossMine)
ƒ
Statistical Relational Learning‐Based Method (SRL)
Classification of Homogeneous Information Networks
110
Probabilistic Relational Models in Statistical Relational Learning
ƒ
ƒ
ƒ
Goal: model distribution of data in relational databases
ƒ Treat both entities and relations as classes
ƒ Intuition: objects are no longer independent with each other
ƒ Build statistical networks according to the dependency relationships between attributes of different classes
A Probabilistic Relational Models (PRM) consists of ƒ Relational schema (from databases)
ƒ Dependency structure (between attributes)
ƒ Local probability model (conditional probability distribution)
Three major methods of probabilistic relational models
ƒ Relational Bayesian Network (RBN, Lise Getoor et al.)
ƒ Relational Markov Networks (RMN, Ben Taskar et al.) ƒ Relational Dependency Networks (RDN, Jennifer Neville et al.) 111
Relational Bayesian Networks (RBN)
ƒ
ƒ
Extend Bayesian network to consider entities, properties and relationships in a DB scenario
Three different uncertainties
ƒ Attribute uncertainty
ƒ Model conditional probability for an attribute given its parent variables
ƒ Structural uncertainty
ƒ Model conditional probability for a reference or link existence given its parent variables
ƒ Class uncertainty
ƒ Refine the conditional probability by considering sub‐
classes or hierarchy of classes
112
Rel. Markov Networks & Rel. Dependency Networks
ƒ
Similar ideas to Relational Bayesian Networks
ƒ
Relational Markov Networks
ƒ
ƒ
Extend from Markov Network
ƒ
Undirected link to model dependency relation instead of directed links as in Bayesian networks
Relational Dependency Networks
ƒ
Extend from Dependency Network
ƒ
Undirected link to model dependency relation
ƒ
Use pseudo‐likelihood instead of exact likelihood
ƒ
Efficient in learning
113
Classification of Information Networks
ƒ
ƒ
Classification of Heterogeneous Information Networks:
ƒ
Graph‐regularization‐Based Method (GNetMine)
ƒ
Multi‐Relational‐Mining‐Based Method (CrossMine)
ƒ
Statistical Relational Learning‐Based Method (SRL)
Classification of Homogeneous Information Networks
114
Transductive Learning in the Graph
ƒ
Problem: for a set of nodes in the graph, the class labels are given for partial of the nodes, the task is to learn the labels of the unlabeled nodes
ƒ
Methods
ƒ Label propagation algorithm [Zhu et al. 2002, Zhou et al. 2004, Szummer et al. 2001]
ƒ Iteratively propagate labels to its neighbors, according to the transition probability defined by the network
ƒ Graph regularization‐based algorithm [Zhou et al. 2004]
ƒ Intuition: trade off between (1) consistency with the labeling data and (2) consistency between linked objects
ƒ An quadratic optimization problem
115
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
116
Data Cleaning by Link Analysis ƒ
Object reconciliation vs. object distinction as data cleaning tasks
ƒ
Link analysis may take advantages of redundancy and make facilitate entity cross‐checking and validation
ƒ
Object distinction: Different people/objects do share names
ƒ In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten”
ƒ In DBLP, 141 papers are written by at least 14 “Wei Wang”
New challenges of object distinction:
ƒ Textual similarity cannot be used
Distinct: Object distinction by information network analysis ƒ X. Yin, J. Han, and P. S. Yu, “Object Distinction: Distinguishing Objects with Identical Names by Link Analysis”, ICDE'07
ƒ
ƒ
117
Entity Distinction: The “Wei Wang” Challenge in DBLP
(2)
(1)
Wei Wang, Jiong Yang,
Richard Muntz
VLDB
Haixun Wang, Wei Wang, Jiong
Yang, Philip S. Yu
Jiong Yang, Hwanjo Yu, Wei
Wang, Jiawei Han
1997
SIGMOD
CSB
VLDB
2004
Hongjun Lu, Yidong Yuan, Wei
Wang, Xuemin Lin
ICDE
2005
ADMA
2005
2003
Wei Wang, Xuemin Lin
Jiong Yang, Jinze Liu, Wei Wang
Jinze Liu, Wei Wang
2002
Wei Wang, Haifeng Jiang, Hongjun
Lu, Jeffrey Yu
ICDM
KDD
2004
Jian Pei, Jiawei Han, Hongjun
Lu, et al.
ICDM
2001
2004
(4)
Jian Pei, Daxin Jiang,
Aidong Zhang
ICDE
2005
Aidong Zhang, Yuqing
Song, Wei Wang
WWW
2003
(3)
Wei Wang, Jian Pei,
Jiawei Han
CIKM
2002
Haixun Wang, Wei Wang, Baile Shi,
Peng Wang
Yongtai Zhu, Wei Wang, Jian Pei, Baile
Shi, Chen Wang
KDD
ICDM
2005
2004
(1) Wei Wang at UNC
(3) Wei Wang at Fudan Univ., China
(2) Wei Wang at UNSW, Australia
(4) Wei Wang at SUNY Buffalo
118
The DISTINCT Methodology
ƒ
Measure similarity between references
ƒ
Link‐based similarity: Linkages between references
ƒ
ƒ
References to the same object are more likely to be connected (Using random walk probability)
Neighborhood similarity
ƒ
Neighbor tuples of each reference can indicate similarity between their contexts
ƒ
Self‐boosting: Training using the “same” bulky data set
ƒ
Reference‐based clustering
ƒ
Group references according to their similarities
119
Training with the “Same” Data Set
ƒ
ƒ
Build a training set automatically
ƒ
Select distinct names, e.g., Johannes Gehrke
ƒ
The collaboration behavior within the same community share some similarity
ƒ
Training parameters using a typical and large set of “unambiguous” examples Use SVM to learn a model for combining different join paths
ƒ
Each join path is used as two attributes (with link‐based similarity and neighborhood similarity)
ƒ
The model is a weighted sum of all attributes
120
Clustering: Measure Similarity between Clusters
ƒ
Single‐link (highest similarity between points in two clusters) ?
ƒ
ƒ
Complete‐link (minimum similarity between them)?
ƒ
ƒ
No, because references to different objects can be connected.
No, because references to the same object may be weakly connected.
Average‐link (average similarity between points in two clusters)?
ƒ
A better measure
ƒ
Refinement: Average neighborhood similarity and collective random walk probability 121
Real Cases: DBLP Popular Names
Name
Num_authors
Num_refs
accuracy
Hui Fang
3
9
1.0
1.0
1.0
1.0
Ajay Gupta
4
16
1.0
1.0
1.0
1.0
Joseph Hellerstein
2
151
0.81
1.0
0.81
0.895
Rakesh Kumar
2
36
1.0
1.0
1.0
1.0
Michael Wagner
5
29
0.395
1.0
0.395
0.566
Bing Liu
6
89
0.825
1.0
0.825
0.904
Jim Smith
3
19
0.829
0.888
0.926
0.906
Lei Wang
13
55
0.863
0.92
0.932
0.926
Wei Wang
14
141
0.716
0.855
0.814
0.834
Bin Yu
5
44
0.658
1.0
0.658
0.794
0.81
0.966
0.836
0.883
Average
precision
recall
f-measure
122
Distinguishing Different “Wei Wang”s
UNC-CH
(57)
Fudan U, China
(31)
Zhejiang U
China
(3)
Najing Normal
China
(3)
SUNY
Binghamton
(2)
Ningbo Tech
China
(2)
Purdue
(2)
Chongqing U
China
(2)
Harbin U
China
(5)
Beijing U Com
China
5
2
UNSW, Australia
(19)
6
SUNY
Buffalo
(5)
Beijing
Polytech
(3)
NU
Singapore
(5)
(2)
123
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
124
Truth Validation by Info. Network Analysis
ƒ
ƒ
The trustworthiness problem of the web (according to a survey):
ƒ
54% of Internet users trust news web sites most of time
ƒ
26% for web sites that sell products
ƒ
12% for blogs
TruthFinder: Truth discovery on the Web by link analysis ƒ
ƒ
Veracity (conformity to truth): ƒ
ƒ
Among multiple conflict results, can we automatically identify which one is likely the true fact?
Given a large amount of conflicting information about many objects, provided by multiple web sites (or other information providers), how to discover the true fact about each object?
Our work: Xiaoxin Yin, Jiawei Han, Philip S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08
125
Conflicting Information on the Web
ƒ
Different websites often provide conflicting info. on a subject,
e.g., Authors of “Rapid Contextual Design”
Online Store
Authors
Powell’s books
Holtzblatt, Karen
Barnes & Noble
Karen Holtzblatt, Jessamyn Wendell, Shelley Wood
A1 Books
Karen Holtzblatt, Jessamyn Burns Wendell, Shelley Wood
Cornwall books
Holtzblatt-Karen, Wendell-Jessamyn Burns, Wood
Mellon’s books
Wendell, Jessamyn
Lakeside books
WENDELL, JESSAMYNHOLTZBLATT, KARENWOOD,
SHELLEY
Blackwell online
Wendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley
126
Our Setting: Info. Network Analysis
ƒ
ƒ
ƒ
Each object has a set of conflictive facts
ƒ E.g., different author names for a book
And each web site provides some facts
How to find the true fact for each object?
Web sites
Facts
w1
f1
w2
w3
w4
f2
Objects
o1
f3
f4
o2
f5
127
Basic Heuristics for Problem Solving
1.
2.
3.
4.
There is usually only one true fact for a property of an object
This true fact appears to be the same or similar on different web sites
ƒ
E.g., “Jennifer Widom” vs. “J. Widom”
The false facts on different web sites are less likely to be the same or similar
ƒ
False facts are often introduced by random factors
A web site that provides mostly true facts for many objects will likely provide true facts for other objects
128
Overview of the TruthFinder Method
ƒ
ƒ
Confidence of facts ↔ Trustworthiness of web sites
ƒ
A fact has high confidence if it is provided by (many) trustworthy web sites
ƒ
A web site is trustworthy if it provides many facts with high confidence
The TruthFinder mechanism, an overview:
ƒ
Initially, each web site is equally trustworthy
ƒ
Based on the above four heuristics, infer fact confidence from web site trustworthiness, and then backwards
ƒ
Repeat until achieving stable state
129
Analogy to Authority‐Hub Analysis
ƒ
Facts ↔ Authorities, Web sites ↔ Hubs
Web sites
High
trustworthiness
w1
Hubs
ƒ
Facts
f1
High
confidence
Authorities
Difference from authority‐hub analysis
ƒ
ƒ
Linear summation cannot be used
ƒ
A web site is trustable if it provides accurate facts, instead of many facts
ƒ
Confidence is the probability of being true
Different facts of the same object influence each other
130
Inference on Trustworthness
ƒ
Inference of web site trustworthiness & fact confidence
Web sites
Facts
w1
f1
w2
f2
w3
f3
w4
f4
Objects
o1
o2
True facts and trustable web sites will become apparent
after some iterations
131
Computational Model: t(w) and s(f)
ƒ
The trustworthiness of a web site w: t(w)
ƒ Average confidence of facts it provides
t (w) =
ƒ
∑ f ∈F (w) s( f )
F (w )
Sum of fact confidence
t(w1)
w1
Set of facts provided by w
The confidence of a fact f: s(f)
ƒ One minus the probability that all web sites t(w2)
providing f are wrong
w2
s( f ) = 1 −
∏( ()1 − t (w))
s(f1)
f1
Probability that w is wrong
w∈W f
Set of websites providing f
132
Experiments: Finding Truth of Facts
ƒ
Determining authors of books
ƒ Dataset contains 1265 books listed on abebooks.com
ƒ We analyze 100 random books (using book images)
Case
Voting
TruthFinder
Barnes & Noble
Correct
71
85
64
Miss author(s)
12
2
4
Incomplete names
18
5
6
Wrong first/middle names
1
1
3
Has redundant names
0
2
23
Add incorrect names
1
5
5
No information
0
0
2
133
Experiments: Trustable Info Providers
ƒ
Finding trustworthy information sources
ƒ Most trustworthy bookstores found by TruthFinder vs. Top ranked bookstores by Google (query “bookstore”)
TruthFinder
Bookstore
trustworthiness
#book
Accuracy
TheSaintBookstore
0.971
28
0.959
MildredsBooks
0.969
10
1.0
Alphacraze.com
0.968
13
0.947
Google rank
#book
Accuracy
Barnes & Noble
1
97
0.865
Powell’s books
3
42
0.654
Google
Bookstore
134
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
135
Similarity Search in Information Networks
ƒ
ƒ
ƒ
Structural similarity vs. semantic similarity
ƒ Structural similarity: Based on structural/isomorphic similarity of sub‐graph/sub‐network structures
ƒ Semantic similarity: influenced by similar network structures
Graph‐structure‐based indexing and similarity search
ƒ Structure‐based indexing, e.g., gIndex, Spath, …
ƒ Use index to search for similar graph/network structures
Substructure indexing Methods:
ƒ Key problem: What substructures are good indexing features?
ƒ gIndex [Yan, Yu & Han, SIGMOD’04]: Find frequent and discriminative subgraphs (by graph‐pattern mining)
ƒ Spath [Zhao & Han, VLDB’10]: Use decomposed shortest paths as basic indexing features
136
Why S‐Path as Indexing Features? ƒ
ƒ
Shortest paths as neighborhood signatures of vertices (indexing features): scalable and pruning search space effectively
Processing (by query decomposition): Decompose the query graph into a set of indexed shortest paths in SPath
Query
Network
A global lookup table
Neighborhood signature of v3
Semantics‐Based Similarity Search in InfoNet
ƒ
ƒ
Search top‐k similar objects of the same type for a query
ƒ Find researchers most similar with “Christos Faloutsos”
Two critical concepts to define a similarity ƒ Feature space
ƒ Traditional data: attributes denoted as numerical value/vector, set, etc.
ƒ Network data: a relation sequence called “path schema”
ƒ Existing homogeneous network‐based similarity does not deal with this problem
ƒ Measure defined on the feature space
ƒ Cosine, Euclidean distance, Jaccard coefficient, etc.
ƒ PathSim
138
Path Schema for DBLP Queries
ƒ
ƒ
Path schema: A path of InfoNet schema, e.g., APC, APA
Who are most similar to Christos Faloutsos?
139
Flickr: Which Pictures Are Most Similar?
ƒ
ƒ
Some path schema leads to similarity closer to human intuition
But some others are not 140
Not All Similarity Measures Are Good
Favor highly
visible
objects
Not
reasonable
141
Long Path Schema May Not Be Good
ƒ
Repeat the path schema 2, 4, and infinite times for conference similarity query
142
PathSim: Definition & Properties
ƒ
ƒ
Commuting matrix corresponding to a path schema MP
ƒ Product of adjacency matrix of relations in the path schema
ƒ The element MP(i,j) denotes the strength between object i and object j in the semantic of path schema P
ƒ If the weight of adjacency matrix is unweighted, it will denote the number of path instances following path schema P
PathSim
ƒ s(i,j) = 2MP(i,j)/(MP(i,i)+ MP(j,j))
ƒ Properties:
143
Co‐Clustering‐Based Pruning Algorithm
Store commuting matrices for short path schemas & compute top‐k queries on line
ƒ Framework
ƒ Generate co‐clusters for materialized commuting matrices, for feature objects and target objects
ƒ Derive upper bound for similarity between object and target cluster, and between object and object: Safely pruning target clusters and objects if the upper bound similarity is lower than current threshold
ƒ Dynamically update top‐k threshold:
ƒ Performance: Baseline vs. pruning
ƒ
144
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
145
Role Discovery in Network: Why It Matters?
Army communication network (imaginary)
Automatically infer
Commander
Captain
Solider
146
Role Discovery: Extraction Semantic Information from Links
ƒ
ƒ
ƒ
ƒ
Objective: Extract semantic meaning from plain links to finely model and better organize information networks
Challenges
ƒ Latent semantic knowledge
ƒ Interdependency
ƒ Scalability
Opportunity
ƒ Human intuition
ƒ Realistic constraint ƒ Crosscheck with collective intelligence
Methodology: propagate simple intuitive rules and constraints over the whole network
147
Discovery of Advisor‐Advisee Relationships in DBLP Network Input: DBLP research publication network
Output: Potential advising relationship and its ranking (r, [st, ed])
Ref. C. Wang, J. Han, et al., “Mining Advisor‐Advisee Relationships from Research Publication Networks”, SIGKDD 2010
ƒ
ƒ
ƒ
Input: Temporal
collaboration network
Output: Relationship analysis
1999
Ada
2000
2
0 00
Visualized chorological hierarchies
(0.9, [/, 1998])
Ada
Bob
(0.4,
[/, 1998])
(0.5, [/, 2000])
2000
(0.8, [1999,2000])
Jerry
2001
Ying
2002 Smith
th
(0.7,
[2000, 2001])
2004
Jerry
Bob
Ying
2003
(0.49,
[/, 1999])
(0.2,
[2001, 2003])
(0.65, [2002, 2004])
Smith
148
Overall Framework
ƒ
ai: author i
ƒ
pj: paper j
ƒ
py: paper year
ƒ
pn: paper#
ƒ
sti,yi: starting time
ƒ
edi,yi: ending time
ƒ
ri,yi: ranking score
149
Time‐Constrained Probabilistic Factor Graph (TPFG)
yx: ax’s advisor
stx,yx: starting time
edx,yx: ending time
• g(yx, stx, edx) is predefined local feature
• fx(yx,Zx)= max g(yx , stx, edx) under time constraint
• Objective function P({yx})=∏x fx (yx,Z)
• Z={z| x ϵ Yz}
• Yx: set of potential advisors of ax
•
•
150
Experiment Results
ƒ
ƒ
DBLP data: 654, 628 authors, 1076,946 publications, years provided
Labeled data: MathGealogy Project; AI Gealogy Project; Homepage
Datasets
RULE
SVM
IndMAX
TEST1
69.9%
73.4%
75.2%
78.9%
80.2%
84.4%
TEST2
69.8%
74.6%
74.6%
79.0%
81.5%
84.3%
TEST3
80.6%
86.7%
83.1%
90.9%
88.8%
91.3%
heuristics Supervised learning
TPFG
Empirical optimized
parameter parameter
151
Case Study & Scalability
Advisee
Time
Note
David M. 1. Michael I. Jordan
Blei
2. John D. Lafferty
01‐03
PhD advisor, 2004 grad
05‐06
Postdoc, 2006
Hong Cheng
1. Qiang Yang
02‐03
MS advisor, 2003
2. Jiawei Han
04‐08
PhD advisor, 2008
1. Rajeev Motawani
97‐98
“Unofficial advisor”
Sergey Brin
Top Ranked Advisor
152
Graph/Network Summarization: Graph Compression
ƒ
Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes
153
OLAP on Information Networks
ƒ
ƒ
ƒ
ƒ
Why OLAP information networks?
Advantages of OLAP: Interactive exploration of multi‐dimensional and multi‐level space in a data cube Infonet
ƒ Multi‐dimensional: Different perspectives
ƒ Multi‐level: Different granularities
InfoNet OLAP: Roll‐up/drill‐down and slice/dice on information network data
ƒ Traditional OLAP cannot handle this, because they ignore links among data objects
Handling two kinds of InfoNet OLAP
ƒ Informational OLAP
ƒ Topological OLAP
154
Informational OLAP
ƒ
ƒ
ƒ
In the DBLP network, study the collaboration patterns among researchers
Dimensions come from informational attributes attached at the whole snapshot level, so‐called Info‐Dims
I‐OLAP Characteristics: ƒ Overlay multiple pieces of information
ƒ No change on the objects whose interactions are being examined
ƒ In the underlying snapshots, each node is a researcher
ƒ In the summarized view, each node is still a researcher
155
Topological OLAP
ƒ
ƒ
Dimensions come from the node/edge attributes inside individual networks, so‐called Topo‐Dims
T‐OLAP Characteristics
ƒ Zoom in/Zoom out
ƒ Network topology changed: “generalized” nodes and “generalized” edges
ƒ In the underlying network, each node is a researcher
ƒ In the summarized view, each node becomes an institute that comprises multiple researchers
156
InfoNet OLAP: Operations & Framework
InfoNet I‐OLAP
InfoNet T‐OLAP
Overlay multiple snapshots to Shrink the topology & obtain a T‐aggregated form a higher‐level summary via network that represents a compressed view, I‐aggregated network
with topological elements (i.e., nodes and/or Roll‐up
edges) merged and replaced by corresp. higher‐level ones
Return to the set of lower‐level A reverse operation of roll‐up
Drill‐down snapshots from the higher‐level overlaid (aggregated) network
Select a subset of qualifying Select a subnetwork based on Topo‐Dims
Slice/dice
snapshots based on Info‐Dims
Measure is an aggregated graph & other measures like node count, average degree, etc. can be treated as derived
ƒ Graph plays a dual role: (1) data source, and (2) aggregate measure
ƒ Measures could be complex, e.g., maximum flow, shortest path, centrality
ƒ It is possible to combine I‐OLAP and T‐OLAP into a hybrid case
ƒ
157
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
158
Mining Evolution and Dynamics of InfoNet
ƒ
ƒ
Many networks are with time information
ƒ E.g., according to paper publication year, DBLP networks can form network sequences
Motivation: Model evolution of communities in heterogeneous network
ƒ Automatically detect the best number of communities in each timestamp
ƒ Model the smoothness between communities of adjacent timestamps
ƒ Model the evolution structure explicitly
ƒ
Birth, death, split
159
Evolution: Idea Illustration
ƒ
From network sequences to evolutionary communities
160
Graphical Model: A Generative Model
ƒ
Dirichlet Process Mixture Model‐based generative model
ƒ At each timestamp, a community is dependent on historical communities and background community distribution
161
Generative Model & Model Inference
ƒ
To generate a new paper oi
ƒ
Decide whether to join an existing community or a new one
ƒ Join an existing community k with prob. nk/(i‐1+α)
ƒ Join a new community k with prob. α/(i‐ 1 +α): Decide its prior, either from a background distribution (λ) or historical communities ((1‐ λ)πk), with different probabilities, draw the attribute distribution
from the prior
ƒ
Generate oi according to the attribute distribution
ƒ
Greedy inference for each timestamp: Collapse Gibbs sampling, which is trying to sample cluster label for each target object (e.g., paper)
162
Accuracy Study
ƒ
ƒ
The more types of objects used, the better accuracy
Historical prior results in better accuracy
163
Case Study on DBLP
ƒ
Tracking database community evolution
1991
1994
1997
2000
DEXA
DEXA
DEXA Workshop
VLDB
Comm, ACM
VLDB
SIGMOD Conf.
ICDE
ICDE
SIGMOD Conf.
VLDB
SIGMOD Conf.
VLDB
CIKM
DEXA
DEXA
Int. J. MM Studies
TKDE
ICDE
IDEAS
systems
object
data
data
object
oriented
databases
databases
database
database
database
database
information
systems
object
query
oriented
databases
systems
web
(1993)
DBLP Schema
CHI Conf. Comp.
TREC
AAAI
SIGIR
Comm. ACM
PKDD
W. Simu. Conf.
SIGKDD Explor.
SIGCSE
KDD
software
data
systems
retrieval
design
information
knowledge
mining
analysis
text
164
Case Study on Delicious.com
ƒ
Delicious Schema
550
C1
500
C2
Event Count
450
C3
400
350
300
250
200
150
1
1.5
2
2.5
Week
3
3.5
4
165
Outline
ƒ
Motivation: Why Mining Heterogeneous Information Networks?
ƒ
Part I: Clustering, Ranking and Classification
ƒ
ƒ
Clustering and Ranking in Information Networks
ƒ
Classification of Information Networks
Part II: Data Quality and Search in Information Networks
ƒ
ƒ
ƒ
Part III: Advanced Topics on Information Network Analysis
ƒ
Role Discovery and OLAP in Information Networks
Mining Evolution and Dynamics of Information Networks
Conclusions
ƒ
ƒ
Data Cleaning and Data Validation by InfoNet Analysis
Similarity Search in Information Networks
Conclusions
ƒ
Rich knowledge can be mined from information networks
ƒ
What is the magic?
ƒ
ƒ
Heterogeneous, structured information networks!
Clustering, ranking and classification: Integrated clustering, ranking and classification: RankClus, NetClus, GNetMine, …
ƒ
Data cleaning, validation, and similarity search
ƒ
Role discovery, OLAP, and evolutionary analysis
ƒ
Knowledge is power, but knowledge is hidden in massive links!
ƒ
Mining heterogeneous information networks: Much more to be explored!!
167
Future Research
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
From mining current single star network schema to ranking, clustering, …, in multi‐star, multi‐relational databases
Mining information networks formed by structured data linking with unstructured data (text, multimedia and Web)
Mining cyber‐physical networks (networks formed by dynamic sensors, image/video cameras, with information networks)
Enhancing the power of knowledge discovery by transforming massive unstructured data: Incremental information extraction, role discovery, … ⇒ multi‐dimensional structured info‐net
Mining noisy, uncertain, un‐trustable massive datasets by information network analysis approach
Turning Wikipedia and/or Web into structured or semi‐structured databases by heterogeneous information network analysis
168
References: Books on Network Analysis
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
A.‐L. Barabasi. Linked: How Everything Is Connected to Everything Else and What It Means. Plume, 2003.
M. Buchanan. Nexus: Small Worlds and the Groundbreaking Theory of Networks. W. W. Norton & Company, 2003.
D. J. Cook and L. B. Holder. Mining Graph Data. John Wiley & Sons, 2007
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2003
A. Degenne and M. Forse. Introducing Social Networks. Sage Publications, 1999
P. J. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press, 2005.
J. Davies, D. Fensel, and F. van Harmelen. Towards the Semantic Web: Ontology‐Driven Knowledge Management. John Wiley & Sons, 2003.
D. Fensel, W. Wahlster, H. Lieberman, and J. Hendler. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2002.
L. Getoor and B. Taskar (eds.). Introduction to statistical learning. In MIT Press, 2007.
B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006.
J. P. Scott. Social Network Analysis: A Handbook. Sage Publications, 2005.
J. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton & Company, 2003.
D. J.Watts. Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton University Press, 2003.
S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
References: Some Overview Papers
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
T. Berners‐Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, May 2001.
C. Cooper and A Frieze. A general model of web graphs. Algorithms, 22, 2003.
S. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38, 2006.
T. Dietterich, P. Domingos, L. Getoor, S. Muggleton, and P. Tadepalli. Structured machine learning: The next ten years. Machine Learning, 73, 2008
S. Dumais and H. Chen. Hierarchical classification of web content. SIGIR'00.
S. Dzeroski. Multirelational data mining: An introduction. ACM SIGKDD Explorations, July 2003.
L. Getoor. Link mining: a new data mining challenge. SIGKDD Explorations, 5:84{89, 2003.
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. ICML'01
D. Jensen and J. Neville. Data mining in networks. In Papers of the Symp. Dynamic Social Network Modeling and Analysis, National Academy Press, 2002.
T. Washio and H. Motoda. State of the art of graph‐based data mining. SIGKDD Explorations, 5, 2003.
170
References: Some Influential Papers
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web. Computer Networks, 33, 2000.
S. Brin and L. Page. The anatomy of a large‐scale hyper‐textual web search engine. WWW'98.
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg. Mining the web's link structure. COMPUTER, 32, 1999.
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power‐law relationships of the internet topology. ACM SIGCOMM'99 M. Girvan and M. E. J. Newman. Community structure in social and biological networks. In Proc. Natl. Acad. Sci. USA 99, 2002.
B. A. Huberman and L. A. Adamic. Growth dynamics of world‐wide web. Nature, 399:131, 1999.
G. Jeh and J. Widom. SimRank: a measure of structural‐context similarity. KDD'02 J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, models, and methods. COCOON'99
D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. KDD'03
J. M. Kleinberg. Small world phenomena and the dynamics of information. NIPS'01
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. FOCS'00
M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45, 2003.
171
References: Clustering and Ranking (1)
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
E. Airoldi, D. Blei, S. Fienberg and E. Xing, “Mixed Membership Stochastic Blockmodels”, JMLR’08
Liangliang Cao, Andrey Del Pozo, Xin Jin, Jiebo Luo, Jiawei Han, and Thomas S. Huang, “RankCompete: Simultaneous Ranking and Clustering of Web Photos”, WWW’10
G. Jeh and J. Widom, “SimRank: a measure of structural‐context similarity”, KDD'02
Jing Gao, Feng Liang,Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han, “Community Outliers and their Efficient Detection in Information Networks", KDD'10
M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks”, Physical Review E, 2004
M. E. J. Newman and M. Girvan, “Fast algorithm for detecting community structure in networks”, Physical Review E, 2004
J. Shi and J. Malik, “Normalized cuts and image Segmentation”, CVPR'97
Yizhou Sun, Yintao Yu, and Jiawei Han, "Ranking‐Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09
Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09
172
References: Clustering and Ranking (2)
ƒ
Yizhou Sun, Jiawei Han, Jing Gao, and Yintao Yu, "iTopicModel: Information Network‐
Integrated Topic Modeling", ICDM’09
ƒ
Xiaoxin Yin, Jiawei Han, Philip S. Yu. "LinkClus: Efficient Clustering via Heterogeneous Semantic Links", VLDB'06. ƒ
Yintao Yu, Cindy X. Lin, Yizhou Sun, Chen Chen, Jiawei Han, Binbin Liao, Tianyi Wu, ChengXiang Zhai, Duo Zhang, and Bo Zhao, "iNextCube: Information Network‐
Enhanced Text Cube", VLDB'09 (demo)
ƒ
A. Wu, M. Garland, and J. Han. Mining scale‐free networks using geodesic clustering. KDD'04
ƒ
Z. Wu and R. Leahy, “An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation”, IEEE Trans. Pattern Anal. Mach. Intell., 1993.
ƒ
X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. SCAN: A structural clustering algorithm for networks. KDD'07
ƒ
X. Yin, J. Han, and P. S. Yu. Cross‐relational clustering with user's guidance. KDD'05
173
References: Network Classification (1)
ƒ
A. Appice, M. Ceci, and D. Malerba. Mining model trees: A multi‐relational approach. ILP'03
ƒ
Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, and Jiawei Han, "Bipartite Graph‐based Consensus Maximization among Supervised and Unsupervised Models ", NIPS'09 ƒ
L. Getoor, N. Friedman, D. Koller and B. Taskar, “Learning Probabilistic Models of Link Structure”, JMLR’02.
ƒ
L. Getoor, E. Segal, B. Taskar and D. Koller, “Probabilistic Models of Text and Link Structure for Hypertext Classification”, IJCAI WS ‘Text Learning: Beyond Classification’, 2001.
ƒ
L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, “Learning Probabilistic Relational Models”, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, 2001.
ƒ
M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao. “Graph‐based classification on heterogeneous information networks”, ECMLPKDD’10.
ƒ
Q. Lu and L. Getoor, “Link‐based classification”, ICML'03
ƒ
D. Liben‐Nowell and J. Kleinberg, “The link prediction problem for social networks”, CIKM'03
174
References: Network Classification (2)
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
J. Neville, B. Gallaher, and T. Eliassi‐Rad. Evaluating statistical tests for within‐network classifiers of relational data. ICDM'09.
J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. KDD'03
Jennifer Neville, David Jensen, “Relational Dependency Networks”, JMLR’07
M. Szummer and T. Jaakkola, “Partially labeled classication with markov random walks”, In NIPS, volume 14, 2001. M. J. Rattigan, M. Maier, and D. Jensen. Graph clustering with network structure indices. ICML'07
P. Sen, G. M. Namata, M. Galileo, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi‐Rad. Collective classification in network data. AI Magazine, 29, 2008.
B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. IJCAI'01
B. Taskar, P. Abbeel, M.F. Wong, and D. Koller, “Relational Markov Networks”, chapter in L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, 2007 X. Yin, J. Han, J. Yang, and P. S. Yu, “CrossMine: Efficient Classification across Multiple Database Relations”, ICDE'04.
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency”, In NIPS 16, Vancouver, Canada, 2004.
X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation”, Technical Report, 2002.
175
References: Social Network Analysis
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
B. Aleman‐Meza, M. Nagarajan, C. Ramakrishnan, L. Ding, P. Kolari, A. P. Sheth, I. B. Arpinar, A. Joshi, and T. Finin. Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. WWW'06
R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu. Mining newsgroups using networks arising from social behavior. WWW'03
P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. WWW'04 D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Community mining from multi‐relational networks. PKDD'05 P. Domingos. Mining social networks for viral marketing. IEEE Intelligent Systems, 20, 2005.
P. Domingos and M. Richardson. Mining the network value of customers. KDD'01 P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top‐down, compositional, and incremental approach. VLDB'07
G. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self‐organization and identification of web communities. IEEE Computer, 35, 2002.
J. Kubica, A. Moore, and J. Schneider. Tractable group detection on large link data sets. ICDM'03
176
References: Data Quality & Search in Networks
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
I. Bhattacharya and L. Getoor, “Iterative record linkage for cleaning and integration”, Proc. SIGMOD 2004 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'04)
Xin Luna Dong, Laure Berti‐Equille, and Divesh Srivastava, “Integrating conflicting data: The role of source dependence”, PVLDB, 2(1):550–561, 2009.
Xin Luna Dong, Laure Berti‐Equille, and Divesh Srivastava, “Truth discovery and copying detection in a dynamic world”, PVLDB, 2(1):562–573, 2009.
H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis, “Two supervised learning approaches for name disambiguation in author citations”, ICDL'04.
Y. Sun, J. Han, T. Wu, X. Yan, and Philip S. Yu, “PathSim: Path Schema‐Based Top‐K Similarity Search in Heterogeneous Information Networks”, Technical report, CS, UIUC, July 2010.
X. Yin, J. Han, and P. S. Yu, “Object Distinction: Distinguishing Objects with Identical Names by Link Analysis”, ICDE'07
X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, IEEE TKDE, 20(6):796‐808, 2008
P. Zhao and J. Han, “On Graph Query Optimization in Large Networks”, VLDB’10.
177
References: Role Discovery, Summarization and OLAP
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
D. Archambault, T. Munzner, and D. Auber. Topolayout: Multilevel graph layout by topological features. IEEE Trans. Vis. Comput. Graph, 2007.
Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP: Towards Online Analytical Processing on Graphs", ICDM 2008
Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP: A Multi‐
Dimensional Framework for Graph Data Analysis", KAIS 2009.
Xin Jin, Jiebo Luo, Jie Yu, Gang Wang, Dhiraj Joshi, and Jiawei Han, “iRIN: Image Retrieval in Image Rich Information Networks”, WWW'10 (demo paper)
Lu Liu, Feida Zhu, Chen Chen, Xifeng Yan, Jiawei Han, Philip Yu, and Shiqiang Yang, “Mining Diversity on Networks", DASFAA'10
Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. SIGMOD'08
Chi Wang, Jiawei Han, Yuntao Jia, Jie Tang, Duo Zhang, Yintao Yu, and Jingyi Guo, “Mining Advisor‐Advisee Relationships from Research Publication Networks ", KDD'10
Zhijun Yin, Manish Gupta, Tim Weninger and Jiawei Han, “LINKREC: A Unified Framework for Link Recommendation with User Attributes and Graph Structure ”, WWW’10
178
References: Network Evolution
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. KDD'06
ƒ M.‐S. Kim and J. Han. A particle‐and‐density based evolutionary clustering method for dynamic networks. VLDB'09
ƒ J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densification laws, shrinking diameters and possible explanations. KDD'05
ƒ Yizhou Sun, Jie Tang, Jiawei Han, Manish Gupta, Bo Zhao, “Community Evolution Detection in Dynamic Heterogeneous Information Networks”, KDD‐
MLG’10
ƒ
179