Download community outlier

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural modeling fields wikipedia , lookup

Pattern recognition wikipedia , lookup

Mathematical model wikipedia , lookup

Time series wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Jing Gao1, Feng Liang1, Wei Fan2,
Chi Wang1, Yizhou Sun1, Jiawei Han1
University of Illinois, IBM TJ Watson
Debapriya Basu


Determine outliers in information networks
Compare various algorithms which does the
same
2



Eg Internet, Social Networking Sites
Nodes – characterized by feature values
Links - representative of relation between
nodes
3


Outliers – anomalies, novelties
Different kinds of outliers
◦ Global
◦ Contextual
4
Global Outlier
V7
V8
V9
V10 V6
V1
V4 V5
V3
V2
10
30 40
70
100 110
Salary (in $1000)
140
160
5


Unified model considering both nodes and
links
Community discovery and outlier detection
are related processes
6






Treat each object as a multivariate data point
Use K components to describe normal
community behavior and one component to
denote outliers
Induce a hidden variable zi at each object
indicating community
Treat network information as a graph
Model the graph as a Hidden Markov Random
Field on zi
Find the local minimum of the posterior
probability potential energy of the model.
7
outlier
community
label Z
node
feature
X
link
structure W
K: number
of
communitie
s
high-income:
mean: 116k
std: 35k
low-income:
mean: 20k
std: 12k
8
model
parameters
Symbol
Definition
I = {1,2,3….i,..M}
Indices of the objects
V = {v1,v2….vm}
Set of objects
S = {s1,s2,….sm}
Given attributes of
objects
WM*M = {wij}
Adjacency matrix
containing the weights of
the links
Z = {z1,…..,zm}
RVs for hidden labels of
objects
X = {x1,…..,xm}
RVs for observed data
Ni
Neighborhood of object
vi
(i ∈ I)
1,….,k,….K
Indices of normal
communities
Θ = {Θ1, Θ2,……, Θk}
R.Vs for model
parameters
9
◦ Set of R.Vs X are conditionally independent given their labels
P(X=S|Z) = ΠP(xi=si|zi)
◦ Kth normal community is characterized by a set of parameters
P(xi=si|zi =k) = P(xi=si|Θk)
◦ Outliers are characterized by uniform distribution
◦ P(xi=si|zi =0) = ρ0
◦ Markov random field is defined over hidden variable Z
◦ P(zi|zI-{i}) = P(zi|zNi)
◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1
H1 = normalizing constant, U(Z) = sum of clique potentials.
◦ Goal is to find the configuration of z that maximizes
P(X=S|Z)P(Z) for a given Θ
10

Continuous Data
◦ Is modeled as Gaussian distribution
◦ Model parameters: mean, standard deviation

Text Data
◦ Is modeled as Multinomial distribution
◦ Model parameters: probability of a word
appearing in a community
11
Initialize Z
Θ : model parameters
Z: community labels
Given Z, find Θ
that maximizes P(X|Z)
PARAMETER
ESTIMATION
Given Θ, find Z
that maximizes P(Z|X)
INFERENCE

12

Calculate model parameters
◦ maximum likelihood estimation

Continuous
◦ mean: sample mean of the community
◦ standard deviation: square root of the sample
variance of the community

Text
◦ probability of a word appearing in the
community: empirical probability
13

Calculate Zi values
◦ Given Model parameters,
◦ Iteratively update the community labels of nodes at each
timestep
◦ Select the label that maximizes P(Z|X,ZN)

Calculate P(Z|X,ZN) values
◦ Both the node features and community labels of
neighbors if Z indicates a normal community
◦ If the probability of a node belonging to any community is
low enough, label it as an outlier
14

Setting Hyper parameters
◦ a0 = threshold
◦ Λ = confidence in the network
◦ K = number of communities

Initialization
◦ Group outliers in clusters.
◦ It will eventually get corrected.
15

Data Generation

Baseline models
◦ Generate continuous data based on Gaussian
distributions and generate labels according to the
model
◦ Define r: percentage of outliers, K: number of
communities
◦ GLODA: global outlier detection (based on node
features only)
◦ DNODA: local outlier detection (check the feature
values of direct neighbors)
◦ CNA: partition data into communities based on
links and then conduct outlier detection in each
community
16
0.8
0.7
0.6
0.5
GLODA
DNODA
0.4
CNA
0.3
CODA
0.2
0.1
0
r=1 % K=5
r=5 % K=5
r=1 % K=8
r=5 % K=8
17

Communities
◦ data mining, artificial intelligence, database,
information analysis

Sub network of Conferences
 Links: percentage of common authors among two
conferences
 Node features: publication titles in the conference

Sub network of Authors
 Links: co-authorship relationship
 Node features: titles of publications by an author
18
Community outliers: CVPR CIKM
19

Community Outliers

Community Outlier Detection
QUESTIONS
20




On Community Outliers and their Efficient Detection
in Information Networks – Gao, Liang, Fan, Wang,
Sun, Han
Outlier detection – Irad Ben-Gal
Automated detection of outliers in real-world data –
Last, Kandel
Outlier Detection for High Dimensional Data –
Aggarwal, Yu
21