Download lecture slides - CSE, IIT Bombay

Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen ACFOCS 2004 Chakrabarti 1 Acknowledgments       C. Faloutsos, CMU W. Cohen, CMU IBM Almaden (many colleagues) IIT Bombay (many students) S. Sarawagi, IIT Bombay S. Sudarshan, IIT Bombay ACFOCS 2004 Chakrabarti 2 Graphs are everywhere      Phone network, Internet, Web Databases, XML, email, blogs Web of trust (epinion) Text and language artifacts (WordNet) Commodity distribution networks Internet Map [lumeta.com] ACFOCS 2004 Food Web [Martinez1991] Chakrabarti Protein Interactions [genomebiology.com] 3 Why analyze graphs?  What properties do real-life graphs have?  How important is a node? What is importance?  Who is the best customer to target in a social network?  Who spread a raging rumor?  How similar are two nodes?  How do nodes influence each other?  Can I predict some property of a node based on its neighborhood? ACFOCS 2004 Chakrabarti 4 Outline, some more detail  Part 1 (Modeling graphs)  What do real-life graphs look like?  What laws govern their formation, evolution and properties?  What structural analyses are useful?  Part 2 (Analyzing graphs)     ACFOCS 2004 Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining Chakrabarti 5 Modeling and generating realistic graphs ACFOCS 2004 Chakrabarti 6 Questions  What do real graphs look like?  Edges, communities, clustering effects  What properties of nodes, edges are important to model?  Degree, paths, cycles, …  What local and global properties are important to measure?  How to artificially generate realistic graphs? ACFOCS 2004 Chakrabarti 7 Modeling: why care?  Algorithm design  Can skewed degree distribution make our algorithm faster?  Extrapolation  How well will Pagerank work on the Web 10 years from now?  Sampling  Make sure scaled-down algorithm shows same performance/behavior on large-scale data  Deviation detection  Is this page trying to spam the search engine? ACFOCS 2004 Chakrabarti 8 Laws – degree distributions  Q: avg degree is ~10 - what is the most probable degree? count ?? 10 ACFOCS 2004 degree Chakrabarti 9 Laws – degree distributions  Q: avg degree is ~10 - what is the most probable degree? count ?? 10 ACFOCS 2004 count degree Chakrabarti 10 degree 10 Power-law: outdegree O Frequency Exponent = slope O = -2.15 -2.15 Nov’97 Outdegree The plot is linear in log-log scale [FFF’99] freq = degree (-2.15) ACFOCS 2004 Chakrabarti 11 Power-law: rank R outdegree Exponent = slope R = -0.74 R Dec’98 Rank: nodes in decreasing outdegree order  The plot is a line in log-log scale ACFOCS 2004 Chakrabarti 12 Eigenvalues  Let A be the adjacency matrix of graph  The eigenvalue  satisfies  A v =  v, where v is some vector  Eigenvalues are strongly related to graph topology A B C A B A B C D 1 1 1 1 C 1 D 1 D ACFOCS 2004 Chakrabarti 13 Power-law: eigenvalues of E  Eigenvalues in decreasing order Eigenvalue Exponent = slope E = -0.48 Dec’98 Rank of decreasing eigenvalue ACFOCS 2004 Chakrabarti 14 The Node Neighborhood  N(h) = # of pairs of nodes within h hops  Let average degree = 3  How many neighbors should I expect within 1,2,… h hops?  Potential answer:     ACFOCS 2004 1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h Chakrabarti 15 The Node Neighborhood  N(h) = # of pairs of nodes within h hops  Let average degree = 3  How many neighbors should I expect within 1,2,… h hops?  Potential answer:     ACFOCS 2004 1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h WE HAVE DUPLICATES! Chakrabarti 16 The Node Neighborhood  N(h) = # of pairs of nodes within h hops  Let average degree = 3  How many neighbors should I expect within 1,2,… h hops?  Potential answer:     ACFOCS 2004 1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h ‘avg’ degree: meaningless! Chakrabarti 17 Power-law: hop-plot H # of Pairs # of Pairs H = 4.86 Hops Dec 98 H = 2.83 Hops Router level ’95 Pairs of nodes as a function of hops N(h)= hH ACFOCS 2004 Chakrabarti 18 Observation  Q: Intuition behind ‘hop exponent’?  A: ‘intrinsic=fractal dimensionality’ of the network ... N(h) ~ h1 ACFOCS 2004 N(h) ~ h2 Chakrabarti 19 Any other ‘laws’?  The Web looks like a “bow-tie” [Kumar+1999]  IN, SCC, OUT, ‘tendrils’  Disconnected components ACFOCS 2004 Chakrabarti 20 Generators  How to generate graphs from a realistic distribution?  Difficulty: simultaneously preserving many local and global properties seen in realistic graphs  Erdos-Renyi: switch on each edge independently with some probability  Problem: degree distribution not power-law  Degree-based  Process-based (“preferential attachment”) ACFOCS 2004 Chakrabarti 21 Degree-based generator  Fix the degree distribution (e.g., ‘Zipf’)  Assign degrees to nodes  Add matching edges to satisfy degrees  No direct control over other properties  “ACL model” [AielloCL2000] ACFOCS 2004 Chakrabarti 22 Process-based: Preferential attachment       Start with a clique with m nodes Add one node v at every time step v makes m links to old nodes Suppose old node u has degree d(u) Let pu = d(u)/ wd(w) v invokes a multinomial distribution defined by the set of p’s  And links to whichever u’s show up  At time t, there are m+t nodes, mt links  What is the degree distribution? ACFOCS 2004 Chakrabarti 23 Preferential attachment: analysis     ki(t) = degree of node i at time t Discrete random variable Approximate as continuous random variable Let i(t) = E(ki(t)), expectation over random linking choices  At time t, the infinitesimal expected growth rate of i(t) is, by linearity of expectation, ti i (t )  m  i (t )  i (t )  i (t )   m  t t 2mt 2t m degrees to add Total degree at t ACFOCS 2004 Time at which node i was born Chakrabarti 24 Preferential attachment, continued  Expected degree of each node grows as square-root of age:  i (t )  m ti t  Let the current time be t  A node must be old enough for its degree to i t  m 2t k 2 be large; for i(t) > k, we need  Therefore, the fraction (m  t )kof2 nodes with k 2 degree t   larger than k is   mt m 2 2  Pr(degree = k)  const/k3 (data closer to 2) ACFOCS 2004 Chakrabarti 25 Bipartite cores  Basic preferential attachment does not  Explain dense/complete bipartite cores (100,000’s in a O(20 million)-page crawl)  Account for influence of search engines 2:3 core (n:m core) Number of cores  The story isn’t over yet n=2 n=7 log m ACFOCS 2004 Chakrabarti 26 Other process-based generators  “Copying model” [KumarRRTU2000]     New node v picks old reference node r u.a.r. v adds k new links to old nodes; for ith link: W.p. a add a link to an old node picked u.a.r. W.p. 1–a copy the ith link from r  More difficult to analyze  Reference node  compression techniques!  H.O.T.: connect to closest, high-connectivity neighbor [Fabrikant+2002]  Winner does not take all [Pennock+2002] ACFOCS 2004 Chakrabarti 27 Reference-based graph compression  Well-motivated: pack graph into limited fast memory for, e.g., query-time Web analysis  Standard approach  Assign integer IDs to URLs, lexicographic order  Delta or Gamma encoding of outlink IDs  If link-copying is rampant, should be able to compress Outlinks(u) by recording  A reference node r  Outlinks(r)  Outlinks(u) …the “correction”  Finding r : what’s optimal? practical? [Adler, Mitzenmacher, Boldi, Vigna 2002—2004] ACFOCS 2004 Chakrabarti 28 Reference-based compression, cont’d  r is a candidate reference for u if Outlinks(r)Outlinks(u) is “large enough”  Given G, construct G’ in which…  Directed edge from r to u with edge cost = number of bits needed to write down Outlinks(u)  Outlinks(r)  Dummy node z, z has no outlinks in G  z connected to each u in G’  cost(z,u) = #bits to write Outlinks(u) w/o ref  Shortest path tree rooted at z  In practice, pick “recent” r… 2.58 bits/link ACFOCS 2004 Chakrabarti 29 log(count) Summary: Power laws are everywhere J. Ullman log(#citations)      log(freq) “a” “the ” log(rank) Bible: rank vs. word frequency Length of file transfers [Bestavros+] Web hit counts [Huberman] Click-stream data [Montgomery+01] Lotka’s law of publication count (CiteSeer data) ACFOCS 2004 Chakrabarti 30 Resources  Generators  R-MAT [email protected]  BRITE www.cs.bu.edu/brite/  INET topology.eecs.umich.edu/inet  Visualization tools  Graphviz www.graphviz.org  Pajek vlado.fmf.uni-lj.si/pub/networks/pajek  Kevin Bacon web site www.cs.virginia.edu/oracle  Erdös numbers etc. ACFOCS 2004 Chakrabarti 31 R-MAT: Recursive MATrix generator  Goals  Power-law in- and outdegrees  Power-law eigenvalues  Small diameter (“six degrees of separation”)  Simple, few parameters From  Approach a (0.5) b (0.1) c (0.15) d (0.25) 2n  Subdivide the adjacency matrix  Choose a quadrant with probability (a,b,c,d) To 2n ACFOCS 2004 Chakrabarti 32 R-MAT algorithm, cont’d b c d a  By construction  Rich gets richer for inand out-degree  Self-similar (communities within communities)  Small diameter ACFOCS 2004 a 2n  Subdivide the adjacency matrix  Choose a quadrant with probability (a,b,c,d)  Recurse till we reach a 11 cell d c 2n Chakrabarti 33 Evaluation on clickstream data Count vs Indegree Count vs Outdegree Hop-plot Singular value vs Rank Left “Network value” Right “Network value” R-MAT matches it well ACFOCS 2004 Chakrabarti 34 Topic structure of the Web  Measure correlations between link proximity and content similarity  How to characterize “topics”? Test doc  Started with http://dmoz.org  Keep pruning until all leaf topics Classifier have enough (>300) samples Topic Prob  Approx 120k sample URLs Arts 0.1 Computers 0.3  Flatten to approx 482 topics Science 0.6  Train a text classifier  Characterize new document d as a vector of probabilities pd = (Pr(c|d) c) ACFOCS 2004 Chakrabarti 35 Sampling the background topic distrib.  What fraction of Web pages are about /Health/Diabetes?  How to sample the Web?  Invoke the “random surfer” model (Pagerank)  Walk from node to node  Sample trail adjusting for Pagerank  Modify Web graph to do better sampling  Self loops  Bidirectional edges ACFOCS 2004 Chakrabarti 36 Convergence Stride=30k Stride=75k Distribution difference 0.4 Background distribution 0.3 0.2 0.1 Sports Society Shopping Science Reference Recreation Home Health 1000 Games 500 Computers Arts 0 Business 1 0.8 0.6 0.4 0.2 0 #hops 0  Start from pairs of diverse topics  Two random walks, sample from each walk  Measure distance between topic distributions  L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]  Below .05 —.2 within 300—400 physical pages ACFOCS 2004 Chakrabarti 37 Biases in topic directories  Use Dmoz to train a classifier  Sample the Web  Classify samples  Diff Dmoz topic distribution from Web sample topic distribution  Report maximum deviation in fractions  NOTE: Not exactly Dmoz ACFOCS 2004 Chakrabarti Dmoz over-represents Games.Video_Games Society.People Arts.Celebrities ...Education.Colleges ...Travel.Reservations Dmoz under-represents …WWW…Directories! Sports.Hockey Society.Philosophy Education…K12… Recreation…Camping 38 Topic-specific degree distribution  Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic  More realistic: u has a topic, and links to v with related topics  Unclear if power-law should be upheld Intra-topic linkage Inter-topic linkage ACFOCS 2004 Chakrabarti 39 Random forward walk without jumps /Arts/Music /Sports/Soccer 1.4 1 1.2 L_1 Distance L_1 Distance 1.2 0.8 0.6 1 0.8 From background From hop0 0.4 0.6 0.2 From background From hop0 0.4 0 5 10 15 Wander hops 20 0 5 10 15 Wander hops 20  Sampling walk is designed to mix topics well  How about walking forward without jumping?  Start from a page u0 on a specific topic  Forward random walk (u0, u1, …, ui, …)  Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution ACFOCS 2004 Chakrabarti 40 Observations and implications  Forward walks wander away from starting topic slowly  But do not converge to the background distribution  Global PageRank ok also for topic-specific queries W.p. d jump to a random node  Jump parameter d=.1—.2  Topic drift not too bad within path length of 5—10  Prestige conferred mostly by same-topic neighbors  Also explains why focused crawling works ACFOCS 2004 Chakrabarti W.p. (1-d) jump to an out-neighbor u.a.r. Jump Highprestige node 41 Citation matrix  Given a page is about topic i, how likely is it to link to topic j?  Matrix C[i,j] = probability that page about topic i links to page about topic j  Soft counting: C[i,j] += Pr(i|u)Pr(j|v) u  Applications v  Classifying Web pages into topics  Focused crawling for topic-specific pages  Finding relations between topics in a directory ACFOCS 2004 Chakrabarti 42 Citation, confusion, correction From topic True topic From topic To topic  Guessed topic  ACFOCS 2004 Arts Business Computers Games Health Home Recreation Reference Science Shopping Society Sports To topic  Classifier’s confusion on held-out documents can be used to correct confusion matrix Chakrabarti 43 Fine-grained views of citation Prominent off-diagonal entries raise design issues for taxonomy editors and maintainers Clear block-structure derived from coarse-grain topics Strong diagonals reflect tightly-knit topic communities ACFOCS 2004 Chakrabarti 44 Outline, some more detail  Part 1 (Modeling graphs)  What do real-life graphs look like?  What laws govern their formation, evolution and properties?  What structural analyses are useful?  Part 2 (Analyzing graphs)     ACFOCS 2004 Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining Chakrabarti 45 Centrality and prestige ACFOCS 2004 Chakrabarti 46 How important is a node?         Degree, min-max radius, … Pagerank Maximum entropy network flows HITS and stochastic variants Stability and susceptibility to spamming Hypergraphs and nonlinear systems Using other hypertext properties Applications: Ranking, crawling, clustering, detecting obsolete pages ACFOCS 2004 Chakrabarti 47 Importance/prestige as Pagerank  A node is important if it is connected to important nodes  “Random surfer” walks along links for ever  If current node has 3 outlinks, take each with probability 1/3  Importance = steady-state u probability (ssp) of visit v OutDegree(u)=3  “Maxwell’s equation for the Web” PR (u ) PR (v )   (u ,v )E OutDegree(u ) ACFOCS 2004 Chakrabarti 48 (Simplified) Pagerank algorithm     Let A be the node adjacency matrix Column normalize AT Want vector p such that AT p = p I.e. p is the eigenvector corresponding to the largest eigenvalue, which is 1 From To 2 1 4 ACFOCS 2004 5 3 AT 1 1 1 1/2 1/2 Chakrabarti p1 p1 p2 p2 = 1/2 p3 1/2 p4 p4 p5 p5 p3 49 Intuition  A as vector transformation x’ A 2 1 2 1 = x 1 3 x’ 1 0 x 1 3 2 1 ACFOCS 2004 Chakrabarti 50 Intuition  By defn., eigenvectors remain parallel to themselves (‘fixed points’) 1 v1 0.52 3.62 * 0.85 = ACFOCS 2004 A 2 1 v1 1 3 0.52 0.85 Chakrabarti 51 Convergence  Usually, fast:  depends on ratio 2 1 : 2 ACFOCS 2004 1 Chakrabarti 52 Eigenvalues and epidemic networks  Will a virus spread across an arbitrary network create an epidemic?  Susceptible-Infected-Susceptible (SIS)     Was healthy, did not got infected Was infected, got cured without further attack Was infected, got cured immediately after attack Cured nodes immediately become susceptible Infected by neighbor Susceptible/ healthy ACFOCS 2004 Cured internally Chakrabarti Infected & infectious 53 The infection model  (virus) Birth rate b: probability than an infected neighbor attacks  (virus) Death rate d: probability that an infected node heals Healthy Prob. d N2 Prob. β N1 N Infected ACFOCS 2004 N3 Chakrabarti 54 Working parameters  pi,t = prob. node i is infected at time t  i ,t   p j ,t 1 (1  b )  (1  p j ,t 1 )   1  p j ,t 1b   ( i , j )E   ( i , j )E Prob. that node i does NOT receive an infection from neighbors    Was healthy, did not got infected 1  pi ,t 1  i ,t  Was infected, got cured without further attack pi ,t 1d i ,t  Was infected, got attacks from neighbors and yet cured itself in this time step 1 … somewhat arbitrary p 1  d i , t  1 i , t 2  ACFOCS 2004  Chakrabarti 55 Time recurrence for pi,t  Assuming various probabilities are “suitably small”   1 p b  1 b p i ,t  j ,t 1   ( i , j )E ( i , j )E j ,t 1  Recurrence of the probability of infection can be approximated by this linear form: pi ,t  1  d  pi ,t 1  b p ( i , j )E j ,t 1  In other words, the (symmetric) transition matrix is S  1  d I  bA ACFOCS 2004 Chakrabarti 56 Epidemic threshold  Virus ‘strength’ s = b/d Prob. δ Healthy N2 Prob. β N1 N Infected N3  Epidemic threshold of a graph is defined as the value of t, such that if strength s = b / d < t, then an epidemic can not happen  Problem: compute epidemic threshold ACFOCS 2004 Chakrabarti 57 Epidemic threshold t What should t depend on?  avg. degree? and/or highest degree?  and/or variance of degree?  and/or third moment of degree? ACFOCS 2004 Chakrabarti 58 Analysis  Eigenvectors of S and A are the same  Eigenvalues are shifted and scaled i ,S  1  d  bi , A  From spectral decomposition p (n) S p n ( 0)  i  u u n T i ,S i , S i ,S  A sufficient condition for infection dying down is that   1  d  b  1 1,S 1, A or b1, A  d or b / d  1,A ACFOCS 2004 Chakrabarti 59 Epidemic threshold  An epidemic must die down if epidemic threshold recovery prob. β/δ <τ = 1/ λ1,A attack prob. largest eigenvalue of adj. matrix A Proof: [Wang+03] ACFOCS 2004 Chakrabarti 60 Experiments Number of Infected Nodes 500 Oregon β = 0.001 b/d > τ (above threshold) 400 300 200 b/d = τ (at the threshold) 100 0 0 250 500 750 Time δ: ACFOCS 2004 0.05 0.06 1000 b/d < τ (below threshold) 0.07 Chakrabarti 61 Remarks  “Primal” problem: topology design  Design a network resilient to infection  “Dual” problem: viral marketing  Selectively convert important customers who can influence many others  Will come back to this later ACFOCS 2004 Chakrabarti 62 Back to the random surfer  Practical issues with Pagerank [BrinP1997]  PR converges only if E is aperiodic and irreducible; make it so: d PR (u ) PR (v )   (1  d )  N (u ,v )E OutDegree(u )  d is the (tuned) probability of “teleporting” to one of N pages uniformly at random (0.1— 0.2 in practice)  (Possibly) unintended consequences: topic sensitivity, stability ACFOCS 2004 Chakrabarti 63 Prestige as network flow      yij =#surfers clicking from i to j per unit time Hits per unit time on page j is H j   y ij ( i , j )E Flow is conserved at j : (i , j )E y ij  ( j ,k )E y jk The total traffic is Y   j H j  i , j y ij Normalize: p ij  y ij /Y Can interpret pij as a probability  Standard Pagerank corresponds to one solution: p ij  H i (Y OutDegree(i ))  Many other solutions possible ACFOCS 2004 Chakrabarti 64 Maximum entropy flow [Tomlin2003]  Flow conservation modeled using feature  1 j  r ,(i ,r )  E  r  1,,N : f r ( x ij )   1 i  r ,(r , j )  E 0 otherwise   And the constraints 0  E (f r ( x ij ))  i , j p ij f r ( x ij )  Goal is to maximize  i , j p ij log p ij subject to i , j p ij  1  Solution has form p ij  exp( 0  i   j )  i is the “hotness” of page i ACFOCS 2004 Chakrabarti 65 Maxent flow results i ranking is better than Pagerank; Hi ranking is worse Two IBM intranet data sets with known top URLs Average rank (106) of known top URLs when sorted by Pagerank Depth up to which dmoz.org URLs are used as ground truth Hi i (Smaller rank is better) Average rank (108) ACFOCS 2004 Chakrabarti 66 HITS [Kleinberg1997]  Two kinds of prestige  Good hubs link to good authorities  Good authorities are linked to by good hubs a (v )  (u ,v )E h (u ); h (u )  (u ,v )E a (v )  In matrix notation, iterations amount to a  ET h h  Ea i.e., a  E Ea T with interleaved normalization of h and a  Note that scores are copied not divided ACFOCS 2004 Chakrabarti 67 HITS graph acquisition  Steps  Get root set via keyword query  Expand by one move forward and backward  Drop same-site links (why?)  Each node assigned both a hub and an auth score  Graph tends to be “topicspecific” IN Root set  Whereas Pagerank is usually run on “the entire Web” (but need not be) ACFOCS 2004 Chakrabarti OUT 68 HITS vs. Pagerank, more comments  Dominant eigenvectors of different matrices  HITS: Eigensystems of EET (h) and ETE (a)  Pagerank: eigensystem of dJ  (1 d )LT where 1/ N J    1/ N  1/ N  1/OutDegree(i ) (i , j )  E     and L(i , j )   0 otherwise   1/ N   HITS copies scores, Pagerank distributes  HITS drops same-site links, Pagerank does (can?) not  Implications? ACFOCS 2004 Chakrabarti 69 HITS: Dyadic interpretation [CohnC2000]  Graph includes many communities z  Query=“Jaguar” gets auto, game, animal links  Each URL is represented as two things  A document d  A citation c  Max  Pr(d ,c ) (d ,c )E   d c d c ( , )E ( , )E Pr(d ) Pr(c | d ) Pr(d )z Pr( z | d ) Pr(c | z )  Guess number of aspects zs and use [Hofmann 1999] to estimate Pr(c|z)  These are the most authoritative URLs ACFOCS 2004 Chakrabarti 70 Dyadic results for “Machine learning” Clustering based on citations + ranking within clusters ACFOCS 2004 Chakrabarti 71 Spamming link-based ranking  Recipe for spamming HITS  Create a hub linking to genuine authorities  Then mix in links to your customers’ sites  Highly susceptible to adversarial behavior  Recipe for spamming Pagerank  Buy a bunch of domains, cloak IP addresses  Host a site at each domain  Sprinkle a few links at random per page to other sites you own  Takes more work than spamming HITS ACFOCS 2004 Chakrabarti 72  Why?  How to design more stable algorithms? ACFOCS 2004 Chakrabarti HITS Authority  Compute HITS authority scores and Pagerank  Delete 30% of nodes/links at random  Recompute and compare ranks; repeat  Pagerank ranks more stable than HITS authority ranks 1 2 3 4 5 6 10 8 Pagerank Stability of link analysis [NgZJ2001] 1 2 3 4 5 6 7 8 3 1 1 5 3 3 12 6 6 52 20 23 171 119 99 135 56 40 179 159 100 316 141 170 1 2 5 3 6 4 7 8 1 2 6 5 3 4 7 8 1 2 4 5 6 3 7 8 1 2 3 4 5 8 7 6 1 2 5 4 3 6 7 9 73 Stability depends on graph and params  Auth score is eigenvector for ETE = S, say  Let 1 > 2 be the first two eigenvalues  There exists an S’ such that  S and S’ are close ||S–S’||F = O(1 –2)  But ||u1 – u’1||2 = (1)  Pagerank p is eigenvector of (U  (1   ) E )T  U is a matrix full of 1/N and  is the jump prob  If set C of nodes are changed in any way, the new Pagerank vector p’ satisfies   p' p 2  2uC pu /  ACFOCS 2004 Chakrabarti 74  T (t ) a   1  (1   ) Erow h  ( t 1) (t ) h   1  (1   ) Ecol a ( t 1)  Much more stable than HITS  Results more meaningful   near 1 will always stabilize  Here  was 0.2 ACFOCS 2004 Chakrabarti Randomized HITS  Each half-step, with probability , teleport to a node chosen uniformly at random 1 4 2 3 5 6 7 8 3 1 2 4 6 5 7 8 3 1 2 4 6 5 7 8 2 1 3 4 6 5 7 8 1 2 4 3 5 6 7 8 Pagerank Randomized HITS 1 3 2 4 5 6 7 8 1 2 3 4 6 7 5 9 1 2 3 4 7 6 5 9 1 2 3 4 5 6 7 9 2 1 3 4 5 6 7 11 75 Another random walk variation of HITS  SALSA: Stochastic HITS [Lempel+2000]  Two separate random walks 1/3 a1  From authority to authority via hub 1/3  From hub to hub via authority 1/3  Transition probability Pr(aiaj) = 1 1 h :(h ,ai ),(h ,a j )E InDegree(a ) OutDegree(h ) i 1/2 a2 1/2  If transition graph is irreducible,  a  InDegree(a )  For disconnected components, depends on relative size of bipartite cores  Avoids dominance of larger cores ACFOCS 2004 Chakrabarti 76 SALSA sample result (“movies”) HITS: The Tightly-Knit Community (TKC) effect SALSA: Less TKC influence (but no reinforcement!) ACFOCS 2004 Chakrabarti 77 Links in relational data [GibsonKR1998]  (Attribute, value) pair is a node  Each node v has weight wv  Each tuple is a hyperedge  Tuple r has weight xr  HITS-like iterations to update weight wv  For each tuple r  (v, u1 ,, uk ) xr  ( wu1 ,  , wuk )  Update weight wv  r xr  Combining operator  can be sum, max, product, Lp avg, etc. ACFOCS 2004 Chakrabarti 78 Database Theory Distilling links in relational data ACFOCS 2004 Author Author Chakrabarti Forum Year 79 Searching and annotating graph data ACFOCS 2004 Chakrabarti 80 Searching graph data  Nodes in graph contain text  RandomIntelligent surfer [RichardsonD2001]  Topic-sensitive Pagerank [Haveliwala2002]  Assigning image captions using random walks [PanYFD2004]  Rotting pages and links [BarYossefBKT2004]  Query is a set of keywords  All keywords may not match a single node  Implicit joins [Hulgeri+2001, Agrawal+2002]  Or rank aggregation [Balmin+2004] required ACFOCS 2004 Chakrabarti 81 Intelligent Web surfer Prq ( j )  (1  b ) Pr'q ( j )  b (i , j )E Prq (i ) Prq (i  j ) Keyword Probability of teleporting to node j Relevance of node k wrt q Pr'q ( j )   Rq ( j ) kV Rq (k ) Pr'q (i  j )  ACFOCS 2004  Rq ( j ) ( i , k )E PR Q ( j )  qQ Pr( q) Prq ( j ) Query=set of words Probability of walking from i to j wrt q Rq (k ) Pick out-link to walk on in proportion to relevance of target out-neighbor Pick a query word per some distribution, e.g. IDF Chakrabarti 82 Implementing the intelligent surfer  PRQ(j) approximates a walk that picks a query keyword using Pr(q) at every step  Precompute and store Prq(j) for each keyword q in lexicon: space blowup = avg doc length  Query-dependent PR rated better by volunteers ACFOCS 2004 Chakrabarti 83 Topic-sensitive Pagerank  High overhead for per-word Pagerank  Instead, compute Pageranks for some collection of broad topics PRc(j)  Topic c has sample page set Sc  Walk as in Pagerank  Jump to a node in Sc uniformly at random  “Project” query onto set of topics Pr(c | Q)  Pr(c)qQ Pr( q | c)  Rank responses by projection-weighted Pageranks Score(Q, j )  Pr(c | Q) PR ( j )  c ACFOCS 2004 Chakrabarti c 84 Topic-sensitive Pagerank results  Users prefer topic-sensitive Pagerank on most queries to global Pagerank + keyword filter ACFOCS 2004 Chakrabarti 85 Image captioning  Segment images into regions  Image has caption words  Three-layer graph: image, regions, caption words  Threshold on region similarity to connect regions (dotted) ACFOCS 2004 Chakrabarti 86 Random walks with restarts Regions Images Test image Words  Find regions in test image  Connect regions to other nodes in the region layer using region similarity  Random walk, restarting at test image node  Pick words with largest visit probability ACFOCS 2004 Chakrabarti 87 More random walks: “Link rot”  How stale is a page?  Last-mod unreliable  Automatic dead-link cleaners mask disuse  A page is “completely stale” if it is “dead”  Let D be the set of pages which cannot be accessed (404 and other problems)  How stale is a page u? Start with p  u  If pD declare decay value of u to be 1, else  With probability  declare decay value of u = 0  W.p. 1– choose outlink v, set pv, loop ACFOCS 2004 Chakrabarti 88 Page staleness results Decay 404s  Decay score is correlated with, but generally larger than the fraction of dead outlinks on a page  Removing direct dead links automatically does not eliminate live but “rotting” pages ACFOCS 2004 Chakrabarti 89 Graph proximity search: two paradigms  A single node as query response  Find node that matches query terms…  …or is “near” nodes matching query terms [Goldman+ 1998]  A connected subgraph as query response  Single node may not match all keywords  No natural “page boundary” [Bhalotia+2002] [Agrawal+2002] ACFOCS 2004 Chakrabarti 90 Single-node response examples  Travolta, Cage Movie  Actor, Face/Off  Travolta, Cage, Movie “is-a” Gathering Grease “acted-in”  Gathering, Grease  Kleiser, Woo, Actor “directed”  Face/Off  Kleiser, Movie Face/Off A3 Travolta Cage “is-a” Actor  Travolta Kleiser Woo “is-a” Director ACFOCS 2004 Chakrabarti 91 Basic search strategy  Node subset A activated because they match query keyword(s)  Look for node near nodes that are activated  Goodness of response node depends  Directly on degree of activation  Inversely on distance from activated node(s) ACFOCS 2004 Chakrabarti 92 Proximity query: screenshot http://www.cse.iitb.ac.in/banks/ ACFOCS 2004 Chakrabarti 93 Ranking a single node response  Activated node set A  Rank node r in “response set” R based on proximity to nodes a in A  Nodes have relevance R and A in [0,1]  Edge costs are “specified by the system”  d(a,r) = cost of shortest path from a to r  Bond between a and r  A (a )  R ( r ) b(a, r )  t d (a, r )  Parameter t tunes relative emphasis on distance and relevance score  Several ad-hoc choices ACFOCS 2004 Chakrabarti 94 Scoring single response nodes  Additive  Belief score(r )  aA b(a, r ) score(r )  1  aA 1  b(a, r )  Goal: list a limited number of find nodes with the largest scores  Performance issues  Assume the graph is in memory?  Precompute all-pairs shortest path (|V |3)?  Prune unpromising candidates? ACFOCS 2004 Chakrabarti 95 Hub indexing  Decompose APSP problem using sparse vertex cuts  |A|+|B | shortest paths to p  |A|+|B | shortest paths to q  d(p,q)  To find d(a,b) compare     d(apb) not through q d(aqb) not through p d(apqb) d(aqpb) A B p a b q  Greatest savings when |A||B|  Heuristics to find cuts, e.g. large-degree nodes ACFOCS 2004 Chakrabarti 96 ObjectRank [Balmin+2004]  Given a data graph with nodes having text  For each keyword precompute a keywordsensitive Pagerank [RichardsonD2001]  Score of a node for multiple keyword search based on fuzzy AND/OR  Approximation to Pagerank of node with restarts to nodes matching keywords  Use Fagin-merge [Fagin2002] to get best nodes in data graph ACFOCS 2004 Chakrabarti 97 Connected subgraph as response  Single node may not match all keywords  No natural “page boundary”  On-the-fly joins make up a “response page”  Two scenarios  Keyword search on relational data • Keywords spread among normalized relations  Keyword search on XML-like or Web data • Keywords spread among DOM nodes and subtrees ACFOCS 2004 Chakrabarti 98 Keyword search on relational data  Tuple = node  Some columns have text  Foreign key constraints = edges in schema graph  Query = set of terms  No natural notion AuthorID A1 of a document A2 Cites Citing Cited  Author AuthorID AuthorName  PaperID P1 P2 P2 Paper PaperID PaperName  Writes AuthorID PaperID  AuthorID AuthorName A1 Chaudhuri A2 Sudarshan A3 Hulgeri  Normalization A3  Join may be needed Citing Cited PaperID PaperName to generate results P2 P1 P1 DBXplorer P2 BANKS  Cycles may exist in schema graph: ‘Cites’ ACFOCS 2004 Chakrabarti 99 DBXplorer and DISCOVER  Enumerate subsets of relations in schema graph which, when joined, may contain rows which have all keywords in the query  “Join trees” derived from schema graph  Output SQL query for each join tree  Generate joins, checking rows for matches [Agrawal+2001], [Hristidis+2002] K1,K2,K3 T1 T2 T4 T3 ACFOCS 2004 K3 T3 T5 K2 T5 T2 T2 T4 T4 T2 T3 Chakrabarti T2 T3 T5 100 Discussion  Exploits relational schema information to contain search  Pushes final extraction of joined tuples into RDBMS  Faster than dealing with full data graph directly ACFOCS 2004  Coarse-grained ranking based on schema tree  Does not model proximity or (dis) similarity of individual tuples  No recipe for data with less regular (e.g. XML) or ill-defined schema Chakrabarti 101 Motivation from Web search  “Linux modem driver for a Thinkpad A22p” IBM Thinkpads •A20m Thinkpad •A22p Drivers •Windows XP Download •Linux Installation tips •Modem •Ethernet  Hyperlink path matches query collectively  Conjunction query would fail  Projects where X and P work together  Conjunction may retrieve wrong page  General notion of graph proximity ACFOCS 2004 The B System Home Page of Professor X Papers •VLDB… Students •P •Q Chakrabarti Group members •P •S •X P’s home page I work on the B project. 102 Data structures for search  Answer = tree with at least one leaf containing each keyword in query  Group Steiner tree problem, NP-hard  Query term t found in source nodes St  Single-source-shortest-path SSSP iterator  Initialize with a source (near-) node  Consider edges backwards  getNext() returns next nearest node  For each iterator, each visited node v maintains for each t a set v.Rt of nodes in St which have reached v ACFOCS 2004 Chakrabarti 103 Generic expanding search  Near node sets St with S = t St  For all source nodes   S  create a SSSP iterator with source   While more results required     Get next iterator and its next-nearest node v Let t be the term for the iterator’s source s crossProduct = {s}  t ’tv.Rt’ For each tuple of nodes in crossProduct • Create an answer tree rooted at v with paths to each source node in the tuple  Add s to v.Rt ACFOCS 2004 Chakrabarti 104 Search example (“Vu Kleinberg”) Quoc Vu Jon Kleinberg writes writes Organizing Web pages by “Information Unit” cites writes Authoritative sources in a hyperlinked environment A metric labeling problem cites cites Divyakant Agrawal author ACFOCS 2004 paper writes writes cites Chakrabarti writes Eva Tardos 105 First response Quoc Vu Jon Kleinberg writes writes Organizing Web pages by “Information Unit” cites writes Authoritative sources in a hyperlinked environment A metric labeling problem cites cites Divyakant Agrawal author ACFOCS 2004 paper writes writes cites Chakrabarti writes Eva Tardos 106 Subgraph search: screenshot http://www.cse.iitb.ac.in/banks/ ACFOCS 2004 Chakrabarti 107 Similarity, neighborhood, influence ACFOCS 2004 Chakrabarti 108 Why are two nodes similar?  What is/are the best paths connecting two nodes explaining why/how they are related?  Graph of co-starring, citation, telephone call, …  Graph with nodes s and t; budget of b nodes  Find “best” b nodes capturing relationship between s and t [FaloutsosMT2004]:  Proposing a definition of goodness  How to efficiently select best connections Negroponte Palmisano Esther Dyson ACFOCS 2004 Gerstner Chakrabarti 109 Simple proposals that do not work  Shortest path  Pizza boy p gets same attention as g  Network flow a s  sabt is as good as sgt  Voltage b g t p  Connect +1V at s, ground t  Both g and p will be at +0.5V  Observations  Must reward parallel paths  Must reward short paths  Must penalize/tax pizza boys ACFOCS 2004 Chakrabarti 110 Resistive network with universal sink  Connect +1V to s  Ground t  Introduce universal sink  Grounded  Connected to every node a s b g t p  Universal sink is a “tax collector”  Penalizes pizza boys  Penalizes long paths  Goodness of a path is the electric current it carries ACFOCS 2004 Chakrabarti Connected to every node 111 Resistive network algorithm         Ohm’s law: I (u, v)  C (u, v)[V (u )  V (v)] u, v Kirchhoff’s current law: v  s, t : u I (u, v)  0 Boundary conditions (without sink): V ( s)  1, V (t )  0 Solution: V ( v ) C ( u , v )  V (u )  v , for u  s, t w C (u, w) Here C(u,v) is the conductance from u to v Add grounded universal sink z with V(z)=0 Set u : C (u, z )  a w z C (u, w) Display subgraph carrying high current ACFOCS 2004 Chakrabarti 112 Distributions influenced via graphs  Directed or undirected graph; nodes have  Observable properties  Some unobservable (random) state  Edges indicate that distributions over unobservable states are coupled  Many applications     ACFOCS 2004 Hypertext classification (topics are clustered) Social network of customers buying products Hierarchical classification Labeling categorical sequences: pos/ne tagging, sense disambiguation, linkage analysis Chakrabarti 113 Basic techniques  Directed (acyclic) graphs: Bayesian network  Markov networks  (Loopy) belief propagation  Conditional Markov networks  Avoid modeling joint distribution over observable and hidden properties of nodes  Some computationally simple special cases ACFOCS 2004 Chakrabarti 114 Hypertext classification  Want to assign labels to Web pages  Text on a single page may be too little or misleading  Page is not an isolated instance by itself  Problem setup      Web graph G=(V,E) Node uV is a page having text uT Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels  Probabilistic model? Benefits? ACFOCS 2004 Chakrabarti 115 Graph labeling model  Seek a labeling f of all unlabeled nodes so as to maximize Pr( f (V )) Pr( E ,{u : u V } | f (V )) Pr( f (V ) | E ,{uT : u  V })  T  Pr( ) Pr( E,{u T : u V } |  ) Luckily we don’t need to worry about this  Let VK be the nodes with known labels and f(VK) their known label assignments  Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels  Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) ACFOCS 2004 Chakrabarti 116 Markov graph labeling Probability of labeling specific node v… …given edges, text, parital labeling Pr( f (v) | E ,V T , f (V K ))    Sum over all possible labelings of unknown neighbors of v  U T K Pr f ( v ), f ( N ( v )) | E , V , f ( V )  f ( N U ( v )) v    U T K U T K Pr f ( N ( v )) | E , V , f ( V ) Pr f ( v ) | f ( N ( v )), E , V , f ( V )   f ( N U ( v )) v Markov assumption: label of v does not depend on unknown labels outside the set NU(v)  Circularity between f(v) and f(NU(v))  Some form of iterative Gibb’s sampling or MCMC ACFOCS 2004 Chakrabarti 117 Iterative labeling Label estimates of v in the next iteration   Pr( r 1) f (v) | E,V T , f (V K )   f ( N U ( v )) v  wN U    T K U T K Pr f ( w ) | E , V , f ( V ) Pr f ( v ) | f ( N ( v )), E , V , f ( V ) (r ) (v) Joint distribution over neighbors approximated as product of marginals  Take the expectation of this term over NU(v) labelings  Sum over all possible NU(v) labelings still too expensive to compute  In practice, prune to most likely configurations  Let us look at the last term more carefully ACFOCS 2004 Chakrabarti 118 A generative node model    Pr f (v) | f ( N U (v)), V T , E, f (V K )  Pr f (v) | f ( N (v)), vT  By the Markov assumption, finally we need a distribution coupling f(v) and vT (the text on v) and f(N(v))  Can use Bayes classifier as with ordinary text: estimate a parametric model for  Pr f ( N (v)), vT | f (v)  the class-conditional joint distribution between the text on the page v and the labels of neighbors of v  Must make naïve Bayes assumptions to keep practical ACFOCS 2004 Chakrabarti 119 Pictorially…  c=class, t=text, N=neighbors  Text-only model: Pr[t|c]  Using neighbors’ text to judge my topic: Pr[t, t(N) | c] (hurts)  Better model: Pr[t, c(N) | c]  Estimate histograms and update based on neighbors’ histograms ACFOCS 2004 Chakrabarti ? 120 Generative model: results ACFOCS 2004 Chakrabarti 40 %Error  9600 patents from 12 classes marked by USPTO  Patents have text and cite other patents  Expand test patent to include neighborhood  ‘Forget’ fraction of neighbors’ classes 30 20 10 0 0 50 100 %Neighborhood known Text Link Text+Link 121 Detour: generative vs. discriminative  x = feature vector, y = label {0,1} say  Generative method models Pr(x|y) or Pr(x,y) “the generation of data x given label y”  Use Bayes rule to get Pr(y|x)  Inaccurate; x may have many dimensions  Discriminative: directly estimates Pr(y|x)  Simple linear case: want w s.t. w.x > 0 if y=1 and w.x  0 if y=0  Cannot differentiate for w; instead pick some 1 smooth “loss function” Pr( y | x)  1  exp  w  x   Works very well in practice ACFOCS 2004 Chakrabarti 122 Discriminative node model  OA(X) = direct “own” attributes of node X  LD(X) = link-derived attributes of node X  Mode-link: most frequent label of neighbors(X)  Count-link: histogram of neighbor labels  Binary-link: 0/1 histogram of neighbor labels Pr(c | wo , OA( X ))  1 exp( cwoT OA( X )  1) Neighborhood model params Local model params Pr(c | wl , LD( X ))  1 exp( cwlT LD( X )  1) Cˆ ( X )  arg max Pr(c | OA( X )) Pr(c | LD( X )) c  Iterate as in generative case ACFOCS 2004 Chakrabarti 123 Discriminative model: results [Li+2003]  Binary-link and count-link outperform content-only at 95% confidence  Better to separately estimate wl and wo  In+Out+Cocitation better than any subset for LD ACFOCS 2004 Chakrabarti 124 Undirected Markov networks  Clique cC(G) a set of completely connected nodes  Clique potential c(Vc) a function over all possible configurations of nodes in Vc  Decompose Pr(v) as (1/Z)cC(G)c(Vc)  Parametric form Label coupling c ( v c )  exp w c  f c ( v c ) Params of model ACFOCS 2004 Instance Local feature variable Label variable Feature functions Chakrabarti 125 Conditional and relational networks  x = vector of observed features at all nodes  y = vector of labels 1 Pr( y | x)  c (xc , y c )  Z (x) cC (G ) where Z (x)   '  ( x , y  c c c) y ' cC ( G ) c (xc , y c )  exp w c  Fc (xc , y c )  A set of “clique templates” specifying links to use  Other features: “in the same HTML section” ACFOCS 2004 Chakrabarti 126 “Toy problem”: Hierarchical classification  Obvious approaches  Flatten to leaf topics, losing hierarchy info  Level-by-level, compounding error probability  Cascaded generative model Pr(c | d )  Pr( r | d ) Pr(c | d , r ) r c  Pr(c|d,r) estimated as Pr(c|r)Pr(d|c)/Z(r)  Estimate of Pr(d|c) makes naïve independence assumptions if d has high dimensionality  Pr(c|d,r) tends to 0/1 for large dimensions and  Mistake made at shallow levels become irrevocable ACFOCS 2004 Chakrabarti 127 Global discriminative model  Each node has an associated bit X  Propose a parametric form exp( wc  F (d , xr )) Pr( X c  1 | d , xr )  1  exp( wc  F (d , xr ))  Each training instance sets one path to 1, all other nodes have X=0 T d xr=0 xr xr=1 2T+1 wc ACFOCS 2004 F(d,xr) %Accuracy SVM 1v1 Ctree Reuters 1% 41.9 47.3 Reuters 5% 68 71 News20 1% 17.2 21.8 Chakrabarti 128 Network value of customers  Customer = node X in graph, neighbors N  M is a marketing action (promotion, coupon)  Want to predict Pr( X i | X K , Y, M ) Aggregate marketing action Response of customer i Known response of other customers Product attributes  Broader objective is to design action M  Again, we approximate as U K  Pr( X i | Ni , Y, M) Pr( Ni | X , Y, M) C ( NiU ) Sum over unknown neighbor configurations ACFOCS 2004 Chakrabarti 129 Network value, continued  Let the action be boolean  c is the cost of marketing  r0 is the revenue without marketing, r1 with  Expected lift in profit by marketing to customer i in isolation K K 1 ELPi X , Y, M   r1 Pr( X i  1 | X , Y, f i (M))  r0 Pr( X i  1 | X , Y, f i (M))  c  Global effect K ACFOCS 2004 0 Chakrabarti 130 Special case: sequential networks  Text modeled as sequence of tokens drawn from a large but finite vocabulary  Each token has attributes  Visible: allCaps, noCaps, hasXx, allDigits, hasDigit, isAbbrev, (part-of-speech, wnSense)  Not visible: part-of-speech, (isPersonName, isOrgName, isLocation, isDateTime), {starts|continues|ends}-noun-phrase  Visible (symbols) and invisible (states) attributes of nearby tokens are dependent  Application decides what is (not) visible  Goal: Estimate invisible attributes ACFOCS 2004 Chakrabarti 131 Hidden Markov model  A generative sequential model for the joint distribution of states (s) and symbols (o)   S S S s  s0 , s1 ,...sn o  o1 , o2 ,...on |o|   Pr( s , o )   Pr( st 1  st ) Pr( st  ot ) O O O t-1 t-1 t t t+1 t+1 ... ... t 1 ACFOCS 2004 Chakrabarti 132 Using redundant token features  Each o is usually a vector of features extracted from a token  Might have high dependence/redundancy: hasCap, hasDigit, isNoun, isPreposition  Parametric model for Pr(stot) needs to make naïve assumptions to be practical  Overall joint model Pr(s,o) can be very inaccurate  (Same argument as in naïve Bayes vs. SVM or maximum entropy text classifiers) ACFOCS 2004 Chakrabarti 133 Discriminative graphical model  Assume one-stage Markov dependence  Propose direct parametric form for conditional probability of state sequence given symbol sequence St-1  |o|   Pr( s | o )  1   Pr( st | st 1 ) Pr(ot | st ) Pr(o) t 1 O Model |o|  1   s ( st , st 1 )o ( st , st 1 , o, t ) Z (o ) t 1  o (t )  exp k k f k (st 1 , st , o, t ) t-1   Ot St+1 ... Ot+1 ... Log-linear form Feature function; might depend on whole o Parameters to fit ACFOCS 2004 St Chakrabarti 134 Feature functions and parameters  |o |  1  2k   L   log    exp   k f k ( st , st 1 , o, t )     2 s , o D  k   k 2 Penalize  Z (o ) t 1 large params Maximize total conditional likelihood over all instances  Find L/k for each k and perform a gradient-based numerical optimization  Efficient for linear state dependence structure ACFOCS 2004 Chakrabarti 135 Conditional vs. joint: results Penn Treebank: 45 tags, 1M words training data DT NN NN , NN , VBZ RB JJ IN The asbestos fiber , crocidolite, is unusually resilient once PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG it enters the lungs , with even brief exposures to it causing Algorithm/Features HMM with words CRF with words CRF with words and orthography %Error %OOVE* 5.69 45.99 5.55 48.05 4.27 23.76 Orthography: Use words, plus overlapping features: isCap, startsWithDigit, hasHyphen, endsWith… -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies ACFOCS 2004 Chakrabarti Out-of-vocabulary error NNS WDT VBP RP NNS JJ , NNS VBD . symptoms that show up decades later , researchers said . 136 Summary  Graphs provide a powerful way to model many kinds of data, at multiple levels  Web pages, XML, relational data, images…  Words, senses, phrases, parse trees…  A few broad paradigms for analysis  Factors affecting graph evolution over time  Eigen analysis, conductance, random walks  Coupled distributions between node attributes and graph neighborhood  Several new classes of model estimation and inferencing algorithms ACFOCS 2004 Chakrabarti 137 References  [BrinP1998] The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW.  [GoldmanSVG1998] Proximity search in databases. VLDB, 26—37.  [ChakrabartiDI1998] Enhanced hypertext categorization using hyperlinks. SIGMOD.  [BikelSW1999] An Algorithm that Learns What’s in a Name. Machine Learning Journal.  [GibsonKR1999] Clustering categorical data: An approach based on dynamical systems. VLDB.  [Kleinberg1999] Authoritative sources in a hyperlinked environment. JACM 46. ACFOCS 2004 Chakrabarti 138 References  [CohnC2000] Probabilistically Identifying Authoritative Documents, ICML.  [LempelM2000] The stochastic approach for linkstructure analysis (SALSA) and the TKC effect. Computer Networks 33 (1-6): 387-401  [RichardsonD2001] The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. NIPS 14 (1441-1448).  [LaffertyMP2001] Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML.  [BorkarDS2001] Automatic text segmentation for extracting structured records. SIGMOD. ACFOCS 2004 Chakrabarti 139 References  [NgZJ2001] Stable algorithms for link analysis. SIGIR.  [Hulgeri+2001] Keyword Search in Databases. IEEE Data Engineering Bulletin 24(3): 22-32.  [Hristidis+2002] DISCOVER: Keyword Search in Relational Databases. VLDB.  [Agrawal+2002] DBXplorer: A system for keywordbased search over relational databases. ICDE.  [TaskarAK2002] Discriminative probabilistic models for relational data.  [Fagin2002] Combining fuzzy information: an overview. SIGMOD Record 31(2), 109–118. ACFOCS 2004 Chakrabarti 140 References  [Chakrabarti2002] Mining the Web: Discovering Knowledge from Hypertext Data  [Tomlin2003] A New Paradigm for Ranking Pages on the World Wide Web. WWW.  [Haveliwala2003] Topic-Sensitive Pagerank: A Context-Sensitive Ranking Algorithm for Web Search. IEEE TKDE.  [LuG2003] Link-based Classification. ICML.  [FaloutsosMT2004] Connection Subgraphs in Social Networks. SIAM-DM workshop.  [PanYFD2004] GCap: Graph-based Automatic Image Captioning. MDDE/CVPR. ACFOCS 2004 Chakrabarti 141 References  [Balmin+2004] Authority-Based Keyword Queries in Databases using ObjectRank. VLDB.  [BarYossefBKT2004] Sic transit gloria telae: Towards an understanding of the Web’s decay. WWW2004. ACFOCS 2004 Chakrabarti 142

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lecture slides - CSE, IIT Bombay