Download Slide 1

I L M I K N I N Jerry Scripps G N Overview    What is link mining? Motivation Preliminaries     definitions metrics network types Link mining techniques What is Link Mining? Graph Theory Statistics Link Mining Data Machine Learning Social Network Analysis Database What is Link Mining? Examples:  Discovering communities within collaboration networks  Finding authoritative web pages on a given topic  Selecting the most influential people in a social network Link Mining – Motivation Emerging Data Sets     World wide web Social networking Collaboration databases etc. Link Mining – Motivation Direct Applications      What is the community around msu.edu? What are the authoritative pages? Who has the most influence? Who is the likely member of terrorist cell? Is this a news story about crime, politics or business? Link Mining – Motivation Indirect Applications   Convert ordinary data sets into networks Integrate link mining techniques into other techniques Preliminaries    Definitions Metrics Network Types Definitions Community Node (vertex, point, object) Link (edge, arc) Metrics Node     Degree Closeness Betweenness Clustering coefficient Node Pair            Graph distance Min-cut Common neighbors Jaccard’s coef Adamic/adar Pref. attachment Katz Hitting time Rooted pageRank simRank Bibliographic metrics Network    Characteristic path length Clustering coefficient Min-cut Network Types Watts & Strogatz Regular Small World Random Networks – Scale-free  GVSU FaceBook GVSU FaceBook (log scale) 1000 1000 800 600 400 100 10 200 0 600 500 400 300 Degree 200 100 0 1 1000 100 10 Degree 1 Counts  Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-world networks Counts  Network recap Network Type Clustering coefficient Random Low Characteristic Power Law path length Low No Regular High High No Small world High Low ? Scale-free ? ? Yes Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Link-Based Classification ? Include features from linked objects:  building a single model on all features  Fusion of link and attribute models Link-Based Classification Chakrabarti, et al.   Copying data from neighboring web pages actually reduced accuracy Using the label from neighboring page improved accuracy 111011 ? 101011 B 010010 A 011110 A Link-Based Classification Lu & Getoor  Define vectors for attributes and links   Attribute data OA(X) Link data LD(X) constructed using    mode (single feature – class of plurality) count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists) 111011 ? 101011 B 010010 A 011110 A OA (attr) LD (link) 111011 210 1 A … … Model 1Model Model 2 Link-Based Classification Lu & Getoor  Define probabilities for both 1 P(c | w , OA( X ))   Attribute exp(  w OA( X )c)  1 o   Link P(c | wl , LD( X ))  T o 1 exp(  wlT LD( X )c)  1 Class estimation: Cˆ ( X )  P(c | w0 , OA( X )) P(c | w1 , LD( X )) Link-Based Classification Summary    Using class of neighbors improves accuracy Using separate models for attribute and link data further improves accuracy Other considerations:   improvements are possible by using community information knowledge of network type could also benefit classifier Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Link Prediction Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics:  Graph distance  Common neighbors  Jaccards coefficient  Adamic/adar  Preferential attachment  Katz  Hitting time  Rooted PageRank  SimRank Neighborhood Ensemble of paths Link Prediction - results Link Prediction – summary     There is room for growth – best predictor has accuracy of only around 9% Predicting collaborations is difficult Finding communities could help if most collaborations are intra-community New problem could be to predict the direction of the link Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Ranking Ranking – Markov Chain Based    Random-surfer analogy Problem with cycles PageRank uses random vector Ranking – summary   Other methods such as HITS and SALSA also based on Markov chain Ranking has been applied in other areas:   text summarization anomaly detection Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Influence Maximizing influence model-based   Problem – finding the k best nodes to activate to maximize the number of nodes activated Models:   independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij linear threshold – node becomes activated when the percent of its neighbors crosses a threshold Maximizing influence model-based     Models: independent cascade & linear threshold A function f:S→S*, can be created using either model Functions use monte-carlo, hill-climbing solution Submodular functions, f (S  {v})  f (S )  f (T  {v})  f (T ) where ST are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum. Maximizing influence – cost/benefit  Assumptions:    If customer purchases profit is:    product x sells for $100 a discount of 10% can be offered to various prospective customers 90 if discount is offered 100 if discount is not offered Expected lift in profit (ELP) from offering discount is:  90*P(buy|discount) - 100*P(buy|no discount) Maximizing influence – cost/benefit n ELP( X , Y , M )   r1 P( X i  1 | X k , Y , f i1 ( M ))  r0 P( X i  1 | X k , Y , f i 0 ( M ))  c k i 1   Goal is to find M that maximizes global ELP Three approximations used:    single pass greedy hill-climbing       Xi is the decision of customer i to buy Y is vector of product attributes M is vector of marketing decision f is a function to set the ith element of M r0 and r1 are revenue gained c is the cost of marketing Comparison of approaches Size of starting set uses attributes probabilities   Cost/benefit variable - based on max. lift yes extracted from data set Model-based fixed no assigned to links An extension would be to spread influence to the most number of communities Improvements can be made in speed Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Communities Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities Reddy and Kitsuregawa    Bipartite graph Given an initial set of nodes T build I from the nodes pointed to from T Repeat:   use relax_cocite to expand T and I prune T and I using dense bipartite graph function DBPG(T,I,α,β) T u v w I Flake, Lawrence and Giles        Uses Min-cut Start with seed set Add linked nodes Find nodes from outgoing links Create virtual source node Add virtual sink linking it to all nodes Find the min-cut of the virtual source and sink Neville, Adler and Jensen A     0110 Distance based on links and attributes If link exists score is number of B common attributes zero otherwise 1100 score(A,B)=2, score(A,C)=1, score(B,C)=0 Used with 3 partitioning algorithms:  Karger’s Min-Cut  MajorClust  Spectral partitioning by Shi & Malik C 1101 Communities - summary   There are many options for building communities around a small group of nodes Possible future directions   finding communities in networks having different link types impact of network type on community finding techniques Techniques       Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Link Completion Goldenberg, Kubica and Komarek Problem: given a network and n-1 members of a community find the nth         random counting popular NB NN cGraph BayesNet EBS and LR Conclusions   Link mining is a young, dynamic field of study with problem areas that continue to emerge and morph as techniques continue to evolve Opportunities for improvements exist in   using community knowledge using network knowledge We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars. Alan Chadwick Ranking    Based on Markov Chain Rank is sum of node weights from incoming links Breaks down when cycles exist 3 2 4 5 9 6 14 15 9 Ranking - continued  General approach     PageRank  HITS approach     ap = authority score for p Bp = backlinks of p a p    aq / N p  (1   ) E ( p) qB p ap = authority score for p hp = hub score for p Bp = backlinks of p Normalize between iterations ap  hp  h qB p a qB p q q

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1