Download community outlier

Jing Gao1, Feng Liang1, Wei Fan2, Chi Wang1, Yizhou Sun1, Jiawei Han1 University of Illinois, IBM TJ Watson Debapriya Basu   Determine outliers in information networks Compare various algorithms which does the same 2    Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between nodes 3   Outliers – anomalies, novelties Different kinds of outliers ◦ Global ◦ Contextual 4 Global Outlier V7 V8 V9 V10 V6 V1 V4 V5 V3 V2 10 30 40 70 100 110 Salary (in $1000) 140 160 5   Unified model considering both nodes and links Community discovery and outlier detection are related processes 6       Treat each object as a multivariate data point Use K components to describe normal community behavior and one component to denote outliers Induce a hidden variable zi at each object indicating community Treat network information as a graph Model the graph as a Hidden Markov Random Field on zi Find the local minimum of the posterior probability potential energy of the model. 7 outlier community label Z node feature X link structure W K: number of communitie s high-income: mean: 116k std: 35k low-income: mean: 20k std: 12k 8 model parameters Symbol Definition I = {1,2,3….i,..M} Indices of the objects V = {v1,v2….vm} Set of objects S = {s1,s2,….sm} Given attributes of objects WM*M = {wij} Adjacency matrix containing the weights of the links Z = {z1,…..,zm} RVs for hidden labels of objects X = {x1,…..,xm} RVs for observed data Ni Neighborhood of object vi (i ∈ I) 1,….,k,….K Indices of normal communities Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters 9 ◦ Set of R.Vs X are conditionally independent given their labels P(X=S|Z) = ΠP(xi=si|zi) ◦ Kth normal community is characterized by a set of parameters P(xi=si|zi =k) = P(xi=si|Θk) ◦ Outliers are characterized by uniform distribution ◦ P(xi=si|zi =0) = ρ0 ◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi) ◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1 H1 = normalizing constant, U(Z) = sum of clique potentials. ◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ 10  Continuous Data ◦ Is modeled as Gaussian distribution ◦ Model parameters: mean, standard deviation  Text Data ◦ Is modeled as Multinomial distribution ◦ Model parameters: probability of a word appearing in a community 11 Initialize Z Θ : model parameters Z: community labels Given Z, find Θ that maximizes P(X|Z) PARAMETER ESTIMATION Given Θ, find Z that maximizes P(Z|X) INFERENCE  12  Calculate model parameters ◦ maximum likelihood estimation  Continuous ◦ mean: sample mean of the community ◦ standard deviation: square root of the sample variance of the community  Text ◦ probability of a word appearing in the community: empirical probability 13  Calculate Zi values ◦ Given Model parameters, ◦ Iteratively update the community labels of nodes at each timestep ◦ Select the label that maximizes P(Z|X,ZN)  Calculate P(Z|X,ZN) values ◦ Both the node features and community labels of neighbors if Z indicates a normal community ◦ If the probability of a node belonging to any community is low enough, label it as an outlier 14  Setting Hyper parameters ◦ a0 = threshold ◦ Λ = confidence in the network ◦ K = number of communities  Initialization ◦ Group outliers in clusters. ◦ It will eventually get corrected. 15  Data Generation  Baseline models ◦ Generate continuous data based on Gaussian distributions and generate labels according to the model ◦ Define r: percentage of outliers, K: number of communities ◦ GLODA: global outlier detection (based on node features only) ◦ DNODA: local outlier detection (check the feature values of direct neighbors) ◦ CNA: partition data into communities based on links and then conduct outlier detection in each community 16 0.8 0.7 0.6 0.5 GLODA DNODA 0.4 CNA 0.3 CODA 0.2 0.1 0 r=1 % K=5 r=5 % K=5 r=1 % K=8 r=5 % K=8 17  Communities ◦ data mining, artificial intelligence, database, information analysis  Sub network of Conferences  Links: percentage of common authors among two conferences  Node features: publication titles in the conference  Sub network of Authors  Links: co-authorship relationship  Node features: titles of publications by an author 18 Community outliers: CVPR CIKM 19  Community Outliers  Community Outlier Detection QUESTIONS 20     On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data – Last, Kandel Outlier Detection for High Dimensional Data – Aggarwal, Yu 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download community outlier