Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques — Chapter 9 — 9.2. Social Network Analysis Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen 5/9/2017 1 5/9/2017 2 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 5/9/2017 3 Social Networks Social network: A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ... Graphical representation Nodes = members Edges = relationships Examples of typical social networks on the Web Social bookmarking (Del.icio.us) Friendship networks (Facebook, Myspace, LinkedIn) Blogosphere Media Sharing (Flickr, Youtube) Folksonomies Web 2.0 Examples Blogs Blogspot Wordpress Wikis Wikipedia Wikiversity Social Networking Sites Facebook Myspace Orkut Digital media sharing websites Youtube Flickr Social Tagging Del.icio.us Others Twitter Yelp Adapted from H. Liu & N. Agarwal, KDD’08 tutorial Society Nodes: individuals Links: social relationship (family/work/friendship/etc.) S. Milgram (1967) John Guare Six Degrees of Separation Social networks: Many individuals with diverse social interactions between them 7 Communication Networks The Earth is developing an electronic nervous system, a network with diverse nodes and links are -computers -phone lines -routers -TV cables -satellites -EM waves Communication networks: Many non-identical components with diverse connections between them 8 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK 9 “Natural” Networks and Universality Consider many kinds of networks: social, technological, business, economic, content,… These networks tend to share certain informal properties: large scale; continual growth distributed, organic growth: vertices “decide” who to link to interaction restricted to links mixture of local and long-distance connections abstract notions of distance: geographical, content, social,… Social network theory and link analysis Do natural networks share more quantitative universals? What would these “universals” be? How can we make them precise and measure them? How can we explain their universality? 10 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 11 Networks and Their Representations A network (or a graph): G = (V, E), where V: vertices (or nodes), and E: edges (or links) Adjacency matrix: Aij = 1 if there is an edge between vertices i and j; 0 otherwise Weighted networks: Self-edge (self-loop): if an edge connects vertex to itself Simple network/graph if a network has neither self-edges nor multiedges Multi-edge: if more than one edge between the same pair of vertices Edges having weight (strength), usually a real number Directed network (directed graph): if each edge has a direction Aij = 1 if there is an edge from i to j; 0 otherwise Cocitation and Bibliographic Coupling Cocitation of vertices i and j: # of vertices having outgoing edges pointing to both i and j Cocitation of i and j: Cociation matrix: It is a symmetric matrix Diagonal matrix (Cii): total # papers citing i Bibliographic coupling of vertices i and j: # of other vertices to which both point Vertices i and j are co-cited by 3 papers i Bibliographic coupling of i and j: Cociation matrix: Diagonal matrix (Bii): total # papers cited by i Vertices i and j cite 3 same papers Cocitation & Bibliographic Coupling: Comparison Two measures are affected by the number of incoming and outgoing edges that vertices have For strong cocitation: must have a lot of incoming edges Must be well-cited (influential) papers, surveys, or books Takes time to accumulate citations Strong bib-coupling if two papers have similar citations A more uniform indicator of similarity between papers Can be computed as soon as a paper is published Not change over time Recent analysis algorithms HITS explores both cocitation and bibliographic coupling Bipartite Networks Bipartite Network: two kinds of vertices, and edges linking only vertices of unlike types Incidence matrix: Bij = 1 if vertex j links to group i 0 otherwise One can create a one-mode project from the two-mode partite form (but with info loss) The projection to one-mode can be written in terms of the incidence matrix B is follows Tom SIGMOD Mary VLDB Alice EDBT Bob KDD Cindy ICDM Tracy SDM Jack AAAI ICML Mike Lucy Jim Degree and Network Density Degree of a vertex i: # of edges m = 1/2 of sum of degrees of all the vertices: The mean degree c of a vertex in an undirected graph: Density ρ of a graph: A network is dense if density ρ tends to be a constant as n → ∞ A network is sparse if density ρ → 0 as n → ∞. The fraction of nonzero element in the adjacency matrix tends to zero Internet, WWW and friendship networks are usually regarded as sparse 16 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 5/9/2017 17 Some Interesting Network Quantities Connected components: how many, and how large? Network diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon Clustering: to what extent that links tend to cluster “locally”? what is the balance between local and long-distance connections? what roles do the two types of links play? Degree distribution: what is the typical degree in the network? what is the overall distribution? 18 A “Canonical” Natural Network has… Few connected components: often only 1 or a small number, indep. of network size Small diameter: often a constant independent of network size (like 6) or perhaps growing only logarithmically with network size or even shrink? typically exclude infinite distances A high degree of clustering: considerably more so than for a random network in tension with small diameter A heavy-tailed degree distribution: a small but reliable number of high-degree vertices often of power law form 19 Probabilistic Models of Networks All of the network generation models we will study are probabilistic or statistical in nature They can generate networks of any size They often have various parameters that can be set: size of network generated average degree of a vertex fraction of long-distance connections The models generate a distribution over networks Statements are always statistical in nature: with high probability, diameter is small on average, degree distribution has heavy tail Thus, we’re going to need some basic statistics and probability theory 20 Probability and Random Variables A random variable X is simply a variable that probabilistically assumes values in some set set of possible values sometimes called the sample space S of X sample space may be small and simple or large and complex S = {Heads, Tails}, X is outcome of a coin flip S = {0,1,…,U.S. population size}, X is number voting democratic S = all networks of size N, X is generated by preferential attachment Behavior of X determined by its distribution (or density) for each value x in S, specify Pr[X = x] these probabilities sum to exactly 1 (mutually exclusive outcomes) complex sample spaces (such as large networks): distribution often defined implicitly by simpler components might specify the probability that each edge appears independently this induces a probability distribution over networks may be difficult to compute induced distribution 21 Some Basic Notions and Laws Independence: let X and Y be random variables independence: for any x and y, Pr[X = x & Y = y] = Pr[X=x]Pr[Y=y] intuition: value of X does not influence value of Y, vice-versa dependence: e.g. X, Y coin flips, but Y is always opposite of X Expected (mean) value of X: only makes sense for numeric random variables “average” value of X according to its distribution formally, E[X] = S (Pr[X = x] X), sum is over all x in S often denoted by m always true: E[X + Y] = E[X] + E[Y] true only for independent random variables: E[XY] = E[X]E[Y] Variance of X: Var(X) = E[(X – m)^2]; often denoted by s^2 standard deviation is sqrt(Var(X)) = s Union bound: for any X, Y, Pr[X=x & Y=y] <= Pr[X=x] + Pr[Y=y] 22 Convergence to Expectations Let X1, X2,…, Xn be: independent random variables with the same distribution Pr[X=x] expectation m = E[X] and variance s2 independent and identically distributed (i.i.d.) essentially n repeated “trials” of the same experiment natural to examine r.v. Z = (1/n) S Xi, where sum is over i=1,…,n example: number of heads in a sequence of coin flips example: degree of a vertex in the random graph model E[Z] = E[X]; what can we say about the distribution of Z? Central Limit Theorem: as n becomes large, Z becomes normally distributed with expectation m and variance s2/n 23 The Normal Distribution The normal or Gaussian density: applies to continuous, real-valued random variables characterized by mean m and standard deviation s density at x is defined as 2 2 (1/(s sqrt(2p))) exp(-(x-m) /2s ) 2 special case m = 0, s = 1: a exp(-x /b) for some constants a,b > 0 peaks at x = m, then dies off exponentially rapidly the classic “bell-shaped curve” exam scores, human body temperature, remarks: can control mean and standard deviation independently can make as “broad” as we like, but always have finite variance 24 The Binomial Distribution Coin with Pr[heads] = p, flip n times, probability of getting exactly k heads: choose (n, k) = pk(1-p)n-k For large n and p fixed: approximated well by a normal with m = np, s = sqrt(np(1-p)) s/m 0 as n grows leads to strong large deviation www.professionalgambler.com/ binomial.html bounds 25 The Poisson Distribution Like binomial, applies to variables taken on integer values > 0 Often used to model counts of events number of phone calls placed in a given time period number of times a neuron fires in a given time period Single free parameter l, probability of exactly x events: exp(-l) lx/x! mean and variance are both l single photoelectron distribution Binomial distribution with n large, p = l/n (l fixed) converges to Poisson with mean l 26 Power Law (or Pareto) Distributions Heavy-tailed, pareto, or power law distributions: For variables assuming integer values > 0 a probability of value x ~ 1/x Typically 0 < a < 2; smaller a gives heavier tail sometimes also referred to as being scale-free For binomial, normal, and Poisson distributions the tail probabilities approach 0 exponentially fast What kind of phenomena does this distribution model? What kind of process would generate it? 27 Distinguishing Distributions in Data All these distributions are idealized models In practice, we do not see distributions, but data Typical procedure to distinguish between Poisson, power law, … might restrict our attention to a range of values of interest accumulate counts of observed data into equal-sized bins look at counts on a log-log plot power law: a log(Pr[X = x]) = log(1/x ) = -a log(x) linear, slope –a Normal: 2 2 log(Pr[X = x]) = log(a exp(-x /b)) = log(a) – x /b non-linear, concave near mean Poisson: x log(Pr[X = x]) = log(exp(-l) l /x!) also non-linear 28 Zipf’s Law Pareto distribution vs. Zipf’s Law Pareto distributions are continuous probability distributions Zipf's law: a discrete counterpart of the Pareto distribution Zipf's law: Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. General theme: rank events by their frequency of occurrence resulting distribution often is a power law! Other examples: North America city sizes personal income file sizes genus sizes (number of species) 29 Zipf Distribution The same data plotted on linear and logarithmic scales. Both plots show a Zipf distribution with 300 data points Linear scales on both axes 5/9/2017 Logarithmic scales on both axes 30 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 31 Some Models of Network Generation Random graphs (Erdös-Rényi models): gives few components and small diameter does not give high clustering and heavy-tailed degree distributions is the mathematically most well-studied and understood model Watts-Strogatz models: give few components, small diameter and high clustering does not give heavy-tailed degree distributions Scale-free Networks: gives few components, small diameter and heavy-tailed distribution does not give high clustering Hierarchical networks: few components, small diameter, high clustering, heavy-tailed Affiliation networks: models group-actor formation 32 Models of Social Network Generation Random Graphs (Erdös-Rényi models) Watts-Strogatz models Scale-free Networks 5/9/2017 33 Basic Network Measures: Degrees and Clustering Coefficients Let a network G = (V, E), degree of a vertex Undirected network: d(vi): Directed network In-degree of a vertex din(vi): Out-degree of a vertex dout(vi): Clustering coefficients Let Nv be the set of adjacent vertices of v, kv be the number of adjacent vertices to node v Local clustering coefficient for directed network Local clustering coefficient for undirected network For the whole network: Averaging the local clustering coefficient of all the vertices (Watts & Strogatz): 34 The Erdös-Rényi (ER) Model: A Random Graph Model A random graph is obtained by starting with a set of N vertices and adding edges between them at random Different random graph models produce different probability distributions on graphs Most commonly studied is the Erdős–Rényi model, denoted G(N,p), in which every possible edge occurs independently with probability p Random graphs were first defined by Paul Erdős and Alfréd Rényi in their 1959 paper "On Random Graphs” The usual regime of interest is when p ~ 1/N, N is large e.g., p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc. in expectation, each vertex will have a “small” number of neighbors will then examine what happens when N infinity can thus study properties of large networks with bounded degree Sharply concentrated; not heavy-tailed 35 Erdös-Rényi Model (1959) Connect with probability p Pál Erdös p=1/6 N=10 k~1.5 Poisson distribution (1913-1996) - Democratic - Random 36 The a-model The a-model has the following parameters or “knobs”: N: size of the network to be generated k: the average degree of a vertex in the network to be generated p: the default probability two vertices are connected a: adjustable parameter dictating bias towards local connections For any vertices u and v: define m(u,v) to be the number of common neighbors (so far) Key quantity: the propensity R(u,v) of u to connect to v if m(u,v) >= k, R(u,v) = 1 (share too many friends not to connect) if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) else, R(u,v) = p + (m(u,v)/k)^a (1-p) Generate new edges incrementally using R(u,v) as the edge probability; details omitted Note: a = infinity is “like” Erdos-Renyi (but not exactly) 37 The Watts and Strogatz Model Proposed by Duncan J. Watts and Steven Strogatz in their joint 1998 Nature paper A random graph generation model that produces graphs with small-world properties, including short average path lengths and high clustering The model also became known as the (Watts) beta model after Watts used β to formulate it in his popular science book Six Degrees The ER graphs fail to explain two important properties observed in realworld networks: By assuming a constant and independent probability of two nodes being connected, they do not account for local clustering, i.e., having a low clustering coefficient Do not account for the formation of hubs. Formally, the degree distribution of ER graphs converges to a Poisson distribution, rather than a power law observed in most real-world, scale-free networks Watts-Strogatz Model C(p) : clustering coeff. L(p) : average path length (Watts and Strogatz, Nature 393, 440 (1998)) 39 Small Worlds and Occam’s Razor For small a, should generate large clustering coefficients we “programmed” the model to do so Watts claims that proving precise statements is hard… But we do not want a new model for every little property Erdos-Renyi small diameter a-model high clustering coefficient In the interests of Occam’s Razor, we would like to find a single, simple model of network generation… … that simultaneously captures many properties Watt’s small world: small diameter and high clustering 40 Discovered by Examining the Real World… Watts examines three real networks as case studies: the Kevin Bacon graph the Western states power grid the C. elegans nervous system For each of these networks, he: computes its size, diameter, and clustering coefficient compares diameter and clustering to best Erdos-Renyi approx. shows that the best a-model approximation is better important to be “fair” to each model by finding best fit Overall, if we care only about diameter and clustering, a is better than p 41 Case 1: Kevin Bacon Graph Vertices: actors and actresses Edge between u and v if they appeared in a film together Kevin Bacon No. of movies : 46 No. of actors : 1811 Average separation: 2.79 Is Kevin Bacon the most connected actor? NO! 5/9/2017 Rod Steiger Donald Pleasence Martin Sheen Christopher Lee Robert Mitchum Charlton Heston Eddie Albert Robert Vaughn Donald Sutherland John Gielgud Anthony Quinn James Earl Jones Average distance 2.537527 2.542376 2.551210 2.552497 2.557181 2.566284 2.567036 2.570193 2.577880 2.578980 2.579750 2.584440 # of movies 112 180 136 201 136 104 112 126 107 122 146 112 # of links 2562 2874 3501 2993 2905 2552 3333 2761 2865 2942 2978 3787 KevinBacon Bacon Kevin 2.786981 2.786981 46 46 1811 1811 Rank Name 1 2 3 4 5 6 7 8 9 10 11 12 … 876 876 … 42 Bacon -map #1 Rod Steiger #876 Kevin Bacon Donald #2 Pleasence #3 Martin Sheen 43 Case 2: New York State Power Grid Vertices: generators and substations Edges: high-voltage power transmission lines and transformers Line thickness and color indicate the voltage level Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV 44 Case 3: C. Elegans Nervous System Vertices: neurons in the C. elegans worm Edges: axons/synapses between neurons 45 Two More Examples M. Newman on scientific collaboration networks coauthorship networks in several distinct communities differences in degrees (papers per author) empirical verification of giant components small diameter (mean distance) high clustering coefficient Alberich et al. on the Marvel Universe purely fictional social network two characters linked if they appeared together in an issue “empirical” verification of heavy-tailed distribution of degrees (issues and characters) giant component rather small clustering coefficient 46 Towards Scale-Free Networks Major limitation of the Watts-Strogatz model It produces graphs that are homogeneous in degree In contrast, real networks are often scale-free networks inhomogeneous in degree, having hubs and a scale-free degree distribution. Such networks are better described by the preferential attachment family of models, such as the Barabási–Albert (BA) model The Watts-Strogatz model also implies a fixed number of nodes and thus cannot be used to model network growth The leads to the proposal of a new model: scale-free network, a network whose degree distribution follows a power law, at least asymptotically 47 Scale-free Networks: World Wide Web Nodes: WWW documents Links: URL links 800 million documents (S. Lawrence, 1999) ROBOT: collects all URL’s found in a document and follows them recursively R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999) 48 World Wide Web Expected Result Real Result out= 2.45 in = 2.1 k ~ 6 P(k=500) ~ 10-99 NWWW ~ 109 N(k=500)~10-90 Pout(k) ~ k-out P(k=500) ~ 10-6 Pin(k) ~ k- in NWWW ~ 109 N(k=500) ~ 103 J. Kleinberg, et. al, Proceedings of the ICCC (1999) 49 World Wide Web 3 l15=2 [125] 6 1 l17=4 [1346 7] 4 5 2 7 … < l > = ?? Finite size scaling: create a network with N nodes with Pin(k) and Pout(k) < l > = 0.35 + 2.06 log(N) 19 degrees of separation R. Albert et al, Nature (99) <l> nd.edu based on 800 million webpages [S. Lawrence et al Nature (99)] IBM A. Broder et al WWW9 (00) 50 What Does that Mean? Poisson distribution Exponential Network Power-law distribution Scale-free Network 51 Scale-Free Networks The number of nodes (N) is not fixed Networks continuously expand by additional new nodes WWW: addition of new nodes Citation: publication of new papers The attachment is not uniform A node is linked with higher probability to a node that already has a large number of links WWW: new documents link to well known sites (CNN, Yahoo, Google) Citation: Well cited papers are more likely to be cited again 52 Scale-Free Networks Start with (say) two vertices connected by an edge For i = 3 to N: for each 1 <= j < i, d(j) = degree of vertex j so far let Z = S d(j) (sum of all degrees so far) add new vertex i with k edges back to {1, …, i-1}: i is connected back to j with probability d(j)/Z Vertices j with high degree are likely to get more links! —“Rich get richer” Natural model for many processes: hyperlinks on the web new business and social contacts transportation networks Generates a power law distribution of degrees exponent depends on value of k Preferential attachment explains heavy-tailed degree distributions small diameter (~log(N), via “hubs”) Will not generate high clustering coefficient no bias towards local connectivity, but towards hubs 53 Case1: Internet Backbone Nodes: computers, routers Links: physical lines (Faloutsos, Faloutsos and Faloutsos, 1999) 54 Internet-Map 5/9/2017 55 Case2: Actor Connectivity Days of Thunder (1990) Far and Away (1992) Eyes Wide Shut (1999) Nodes: actors Links: cast jointly N = 212,250 actors k = 28.78 P(k) ~k- =2.3 56 Case 3: Science Citation Index 25 Nodes: papers Links: citations Witten-Sander PRL 1981 1736 PRL papers (1988) 2212 P(k) ~k- ( = 3) (S. Redner, 1998) 5/9/2017 57 Case 4: Science Coauthorship Nodes: scientist (authors) Links: write paper together (Newman, 2000, H. Jeong et al 2001) 5/9/2017 58 Case 5: Food Web Nodes: trophic species Links: trophic interactions R. Sole (cond-mat/0011195) R.J. Williams, N.D. Martinez Nature (2000) 59 Robustness of Random vs. Scale-Free Networks 5/9/2017 The accidental failure of a number of nodes in a random network can fracture the system into noncommunicating islands. Scale-free networks are more robust in the face of such failures Scale-free networks are highly vulnerable to a coordinated attack against their hubs 60 Bio-Map GENOME protein-gene interactions PROTEOME protein-protein interactions METABOLISM Bio-chemical reactions Citrate Cycle 5/9/2017 61 Boehring-Mennheim 5/9/2017 62 Prot Interaction Map: Yeast Protein Network Nodes: proteins Links: physical interactions (binding) P. Uetz, et al., Nature 403, 623-7 (2000) 63 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 64 Information on Social Network Heterogeneous, multi-relational data represented as a graph or network Nodes are objects May have different kinds of objects Objects have attributes Objects may have labels or classes Edges are links May have different kinds of links Links may have attributes Links may be directed, are not required to be binary Links represent relationships and interactions between objects - rich content for mining 65 Metrics (Measures) in Social Network Analysis (I) Betweenness: The extent to which a node lies between other nodes in the network. This measure takes into account the connectivity of the node's neighbors, giving a higher value for nodes which bridge clusters. The measure reflects the number of people who a person is connecting indirectly through their direct links Bridge: An edge is a bridge if deleting it would cause its endpoints to lie in different components of a graph. Centrality: This measure gives a rough indication of the social power of a node based on how well they "connect" the network. "Betweenness", "Closeness", and "Degree" are all measures of centrality. Centralization: The difference between the number of links for each node divided by maximum possible sum of differences. A centralized network will have many of its links dispersed around one or a few nodes, while a decentralized network is one in which there is little variation between the number of links each node possesses. 66 Metrics (Measures) in Social Network Analysis (II) Closeness: The degree an individual is near all other individuals in a network (directly or indirectly). It reflects the ability to access information through the "grapevine" of network members. Thus, closeness is the inverse of the sum of the shortest distances between each individual and every other person in the network Clustering coefficient: A measure of the likelihood that two associates of a node are associates themselves. A higher clustering coefficient indicates a greater 'cliquishness'. Cohesion: The degree to which actors are connected directly to each other by cohesive bonds. Groups are identified as ‘cliques’ if every individual is directly tied to every other individual, ‘social circles’ if there is less stringency of direct contact, which is imprecise, or as structurally cohesive blocks if precision is wanted. Degree (or geodesic distance): The count of the number of ties to other actors in the network. 67 Metrics (Measures) in Social Network Analysis (III) (Individual-level) Density: The degree a respondent's ties know one another/ proportion of ties among an individual's nominees. Network or global-level density is the proportion of ties in a network relative to the total number possible (sparse versus dense networks). Flow betweenness centrality: The degree that a node contributes to sum of maximum flow between all pairs of nodes (not that node). Eigenvector centrality: A measure of the importance of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to nodes having a high score contribute more to the score of the node in question. Local Bridge: An edge is a local bridge if its endpoints share no common neighbors. Unlike a bridge, a local bridge is contained in a cycle. Path Length: The distances between pairs of nodes in the network. Average path-length is the average of these distances between all pairs of nodes. 68 Metrics (Measures) in Social Network Analysis (IV) Prestige: In a directed graph prestige is the term used to describe a node's centrality. "Degree Prestige", "Proximity Prestige", and "Status Prestige" are all measures of Prestige. Radiality Degree: an individual’s network reaches out into the network and provides novel information and influence. Reach: The degree any member of a network can reach other members of the network. Structural cohesion: The minimum number of members who, if removed from a group, would disconnect the group Structural equivalence: Refers to the extent to which nodes have a common set of linkages to other nodes in the system. The nodes don’t need to have any ties to each other to be structurally equivalent. Structural hole: Static holes that can be strategically filled by connecting one or more links to link together other points. Linked to ideas of social capital: if you link to two people who are not linked you can control their communication 69 A Taxonomy of Common Link Mining Tasks Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution) Link-Related Tasks Link prediction Graph-Related Tasks Subgraph discovery Graph classification Generative model for graphs 70 Link-Based Object Ranking (LBR) Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single link type A primary focus of link analysis community Web information analysis PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs 71 PageRank: Capturing Page Popularity (Brin & Page’98) Intuitions Links are like citations in literature A page that is cited often can be expected to be more useful in general PageRank is essentially “citation counting”, but improves over simple counting Consider “indirect citations” (being cited by a highly cited paper counts a lot…) Smoothing of citations (every page is assumed to have a non-zero citation count) PageRank can also be interpreted as random surfing (thus capturing popularity) 72 The PageRank Algorithm (Brin & Page’98) Random surfing model: At any page, With prob. a, randomly jumping to a page With prob. (1 – a), randomly picking a link to follow d1 d3 d2 0 1 M 0 1/ 2 0 0 1 1/ 2 1/ 2 1/ 2 0 0 0 0 0 0 pt 1 (di ) (1 a ) d4 d j IN ( di ) p(di ) [ k m ji pt (d j ) a k Same as a/N (why?) 1 pt (d k ) N 1 a (1 a )mki ] p (d k ) N p (a I (1 a ) M )T p Initial value p(d)=1/N “Transition matrix” Iij = 1/N Stationary (“stable”) distribution, so we ignore time Iterate until converge Essentially an eigenvector problem…. 73 HITS: Capturing Authorities & Hubs (Kleinberg’98) Intuitions Pages that are widely cited are good authorities Pages that cite many other pages are good hubs The key idea of HITS Good authorities are cited by good hubs Good hubs point to good authorities Iterative reinforcement … 74 The HITS Algorithm (Kleinberg 98) d1 d3 d2 d4 0 1 A 0 1 h( d i ) a(di ) 0 0 1 1 1 0 0 0 1 0 0 0 d j OUT ( di ) d j IN ( di ) h Aa ; “Adjacency matrix” Initial values: a=h=1 a(d j ) h( d j ) a AT h h AAT h ; a AT Aa Iterate Normalize: a(di ) h(di ) 1 2 i 2 i Again eigenvector problems… 5/9/2017 75 Block-level Link Analysis (Cai et al. 04) Most of the existing link analysis algorithms, e.g. PageRank and HITS, treat a web page as a single node in the web graph However, in most cases, a web page contains multiple semantics and hence it might not be considered as an atomic and homogeneous node Web page is partitioned into blocks using the visionbased page segmentation algorithm extract page-to-block, block-to-page relationships Block-level PageRank and Block-level HITS 76 Link-Based Object Classification (LBC) Predicting the category of an object based on its attributes, its links and the attributes of linked objects Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations Epidemics: Predict disease type based on characteristics of the patients infected by the disease Communication: Predict whether a communication contact is by email, phone call or mail 77 Challenges in Link-Based Classification Labels of related objects tend to be correlated Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph Ex: Classify related news items in Reuter data sets (Chak’98) Simply incorp. words from neighboring documents: not helpful Multi-relational classification is another solution for link-based classification 78 Group Detection Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research communities Methods Hierarchical clustering Blockmodeling of SNA Spectral graph partitioning Stochastic blockmodeling Multi-relational clustering 79 Entity Resolution Predicting when two objects are the same, based on their attributes and their links Also known as: deduplication, reference reconciliation, coreference resolution, object consolidation Applications Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein 80 Entity Resolution Methods Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes Importance at considering links Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents Use of links in resolution Collective entity resolution: one resolution decision affects another if they are linked Propagating evidence over links in a depen. graph Probabilistic models interact with different entity recognition decisions 81 Link Prediction Predict whether a link exists between two entities, based on attributes and other observed links Applications Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper Epidemics: predicting who a patient’s contacts are Methods Often viewed as a binary classification problem Local conditional probability model, based on structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field model 82 Link Cardinality Estimation Predicting the number of links to an object Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of outlinks Citation: predicting the impact of a paper based on the number of citations Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by crawling a site Citation: predicting the number of citations of a particular author in a specific journal 83 Subgraph Discovery Find characteristic subgraphs Focus of graph-based data mining Applications Biology: protein structure discovery Communications: legitimate vs. illegitimate groups Chemistry: chemical substructure discovery Methods Subgraph pattern mining Graph classification Classification based on subgraph pattern analysis 84 Metadata Mining Schema mapping, schema discovery, schema reformulation cite – matching between two bibliographic sources web - discovering schema from unstructured or semistructured data bio – mapping between two medical ontologies 85 Link Mining Challenges Logical vs. statistical dependencies Feature construction: Aggregation vs. selection Instances vs. classes Collective classification and collective consolidation Effective use of labeled & unlabeled data Link prediction Closed vs. open world Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) 86 Social Network Analysis Social Networks: An Introduction Primitives for Network Analysis Different Network Distributions Models of Social Network Generation Mining on Social Network Summary 87 Ref: Mining on Social Networks D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. CIKM’03 P. Domingos and M. Richardson, Mining the Network Value of Customers. KDD’01 M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for Viral Marketing. KDD’02 D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of Influence through a Social Network. KDD’03. P. Domingos, Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1), 80-82, 2005. S. Brin and L. Page, The anatomy of a large scale hypertextual Web search engine. WWW7. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99 D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004. Lecture notes from Lise Getoor’s website: www.cs.umd.edu/~getoor/ 88 5/9/2017 89