Download p338a - CSE, IIT Bombay

The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute) Graph structure of the Web  Over two billion nodes, two trillion links  Power-law degree distribution • Pr(degree = k)  1/k2.1  Looks like a “bow-tie” at large scale IN Strongly connected core (SCC) OUT “This is the Web” The need for content-based models  Why does a radius-1 expansion help in topic distillation? Query Search engine Root set Crawler  Why does topicspecific focused crawling work? Classifier Check Prune frontier topic if irrelevant  Why is a global d p(u ) PageRank useful for p(v)  N  (1  d ) OutDegree(u ) u v specific queries?  Uniform jump Walk to out-neighbor The need for content-based models  How are different topics linked to each other?  Are topic directories representative of Web topic populations?  Are standard collections (e.g., TREC W10G) representative of Web topics? “This is the Web with topics” How to characterize “topics”  Web directories—most natural choice  Started with http://dmoz.org  Keep pruning until all leaf topics Test doc have enough (>300) samples Classifier  Approx 120k sample URLs Topic Prob Arts 0.1  Flatten to approx 482 topics Computers 0.3 0.6  Train text classifier (Rainbow) Science  Characterize new document d as a vector of probabilities pd = (Pr(c|d) c) Critique and defense  Cannot capture fine-grained or emerging topics • Emerging topics most often specialize existing broad topics • Broad topics rarely change  Classifier may be inaccurate • Adequate if much better than random guessing of topic label • Can compensate errors using held-out validation data Background topic distribution  What fraction of Web pages are about Health?  Sampling via random walk • PageRank walk (Henzinger et al.) • Undirected regular walk (BarYossef et al.)  Make graph undirected  Add self-loops so that all nodes have the same degree  Sample with large stride  Collect topic histograms Convergence Stride=30k Stride=75k Distribution difference 0.4 Background distribution 0.3 0.2 0.1  Start from pairs of diverse topics  Two random walks, sample from each walk  Measure distance between topic distributions • L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2] • Below .05 —.2 within 300—400 physical pages Sports Society Shopping Science Reference Recreation Home Health 1000 Games 500 Computers Arts 0 Business 1 0.8 0.6 0.4 0.2 0 #hops 0 Biases in topic directories  Use Dmoz to train a classifier  Sample the Web  Classify samples  Diff Dmoz topic distribution from Web sample topic distribution  Report maximum deviation in fractions  NOTE: Not exactly Dmoz Dmoz over-represents Games.Video_Games Society.People Arts.Celebrities ...Education.Colleges ...Travel.Reservations Dmoz under-represents …WWW…Directories! Sports.Hockey Society.Philosophy Education…K12… Recreation…Camping Topic-specific degree distribution  Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic  More realistic: u has a topic, and links to v with related topics  Unclear if power-law should be upheld Intra-topic linkage Inter-topic linkage Random forward walk without jumps /Arts/Music /Sports/Soccer 1.4 1 1.2 0.8 0.6 L_1 Distance L_1 Distance 1.2 1 0.8 From background From hop0 0.4 0.2 0.6 From background From hop0 0.4 0 5 10 15 Wander hops 20 0 5 10 15 Wander hops  Sampling walk is designed to mix topics well  How about walking forward without jumping? • Start from a page u0 on a specific topic • Forward random walk (u0, u1, …, ui, …) • Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution 20 Observations and implications  Forward walks wander away from starting topic slowly  But do not converge to the background distribution  Global PageRank ok also for topic-specific queries • Jump parameter d=.1—.2 • Topic drift not too bad within path length of 5—10 • Prestige conferred mostly by same-topic neighbors W.p. d jump to a random node W.p. (1-d) jump to an out-neighbor u.a.r. Jump  Also explains why focused crawling works Highprestige node Citation matrix  Given a page is about topic i, how likely is it to link to topic j? • Matrix C[i,j] = probability that page about topic i links to page about topic j • Soft counting: C[i,j] += Pr(i|u)Pr(j|v) u  Applications • Classifying Web pages into topics • Focused crawling for topic-specific pages • Finding relations between topics in a directory v Citation, confusion, correction From topic Arts Business Computers Games Health Home Recreation Reference Science Shopping Society Sports To topic  Classifier’s confusion on held-out documents can be used to correct confusion matrix From topic To topic  Guessed topic  True topic Fine-grained views of citation Prominent off-diagonal entries raise design issues for taxonomy editors and maintainers Clear block-structure derived from coarse-grain topics Strong diagonals reflect tightly-knit topic communities Concluding remarks  A model for content-based communities • New characterization and measurement of topical locality on the Web • How to set the PageRank jump parameter? • Topical stability of topic distillation • Better crawling and classification  A tool for Web directory maintenance • Fair sampling and representation of topics • Block-structure and off-diagonals • Taxonomy inversion

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download p338a - CSE, IIT Bombay