Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Jing Gao1, Feng Liang1, Wei Fan2, Chi Wang1, Yizhou Sun1, Jiawei Han1 University of Illinois, IBM TJ Watson Debapriya Basu Determine outliers in information networks Compare various algorithms which does the same 2 Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between nodes 3 Outliers – anomalies, novelties Different kinds of outliers ◦ Global ◦ Contextual 4 Global Outlier V7 V8 V9 V10 V6 V1 V4 V5 V3 V2 10 30 40 70 100 110 Salary (in $1000) 140 160 5 Unified model considering both nodes and links Community discovery and outlier detection are related processes 6 Treat each object as a multivariate data point Use K components to describe normal community behavior and one component to denote outliers Induce a hidden variable zi at each object indicating community Treat network information as a graph Model the graph as a Hidden Markov Random Field on zi Find the local minimum of the posterior probability potential energy of the model. 7 outlier community label Z node feature X link structure W K: number of communitie s high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k 8 model parameters Symbol Definition I = {1,2,3….i,..M} Indices of the objects V = {v1,v2….vm} Set of objects S = {s1,s2,….sm} Given attributes of objects WM*M = {wij} Adjacency matrix containing the weights of the links Z = {z1,…..,zm} RVs for hidden labels of objects X = {x1,…..,xm} RVs for observed data Ni Neighborhood of object vi (i ∈ I) 1,….,k,….K Indices of normal communities Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters 9 ◦ Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(xi=si|zi) ◦ Kth normal community is characterized by a set of parameters P(xi=si|zi =k) = P(xi=si|Θk) ◦ Outliers are characterized by uniform distribution ◦ P(xi=si|zi =0) = ρ0 ◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi) ◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1 H1 = normalizing constant, U(Z) = sum of clique potentials. ◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ 10 Continuous Data ◦ Is modeled as Gaussian distribution ◦ Model parameters: mean, standard deviation Text Data ◦ Is modeled as Multinomial distribution ◦ Model parameters: probability of a word appearing in a community 11 Initialize Z Θ : model parameters Z: community labels Given Z, find Θ that maximizes P(X|Z) PARAMETER ESTIMATION Given Θ, find Z that maximizes P(Z|X) INFERENCE 12 Calculate model parameters ◦ maximum likelihood estimation Continuous ◦ mean: sample mean of the community ◦ standard deviation: square root of the sample variance of the community Text ◦ probability of a word appearing in the community: empirical probability 13 Calculate Zi values ◦ Given Model parameters, ◦ Iteratively update the community labels of nodes at each timestep ◦ Select the label that maximizes P(Z|X,ZN) Calculate P(Z|X,ZN) values ◦ Both the node features and community labels of neighbors if Z indicates a normal community ◦ If the probability of a node belonging to any community is low enough, label it as an outlier 14 Setting Hyper parameters ◦ a0 = threshold ◦ Λ = confidence in the network ◦ K = number of communities Initialization ◦ Group outliers in clusters. ◦ It will eventually get corrected. 15 Data Generation Baseline models ◦ Generate continuous data based on Gaussian distributions and generate labels according to the model ◦ Define r: percentage of outliers, K: number of communities ◦ GLODA: global outlier detection (based on node features only) ◦ DNODA: local outlier detection (check the feature values of direct neighbors) ◦ CNA: partition data into communities based on links and then conduct outlier detection in each community 16 0.8 0.7 0.6 0.5 GLODA DNODA 0.4 CNA 0.3 CODA 0.2 0.1 0 r=1 % K=5 r=5 % K=5 r=1 % K=8 r=5 % K=8 17 Communities ◦ data mining, artificial intelligence, database, information analysis Sub network of Conferences Links: percentage of common authors among two conferences Node features: publication titles in the conference Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author 18 Community outliers: CVPR CIKM 19 Community Outliers Community Outlier Detection QUESTIONS 20 On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data – Last, Kandel Outlier Detection for High Dimensional Data – Aggarwal, Yu 21