Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute) Graph structure of the Web Over two billion nodes, two trillion links Power-law degree distribution • Pr(degree = k) 1/k2.1 Looks like a “bow-tie” at large scale IN Strongly connected core (SCC) OUT “This is the Web” The need for content-based models Why does a radius-1 expansion help in topic distillation? Query Search engine Root set Crawler Why does topicspecific focused crawling work? Classifier Check Prune frontier topic if irrelevant Why is a global d p(u ) PageRank useful for p(v) N (1 d ) OutDegree(u ) u v specific queries? Uniform jump Walk to out-neighbor The need for content-based models How are different topics linked to each other? Are topic directories representative of Web topic populations? Are standard collections (e.g., TREC W10G) representative of Web topics? “This is the Web with topics” How to characterize “topics” Web directories—most natural choice Started with http://dmoz.org Keep pruning until all leaf topics Test doc have enough (>300) samples Classifier Approx 120k sample URLs Topic Prob Arts 0.1 Flatten to approx 482 topics Computers 0.3 0.6 Train text classifier (Rainbow) Science Characterize new document d as a vector of probabilities pd = (Pr(c|d) c) Critique and defense Cannot capture fine-grained or emerging topics • Emerging topics most often specialize existing broad topics • Broad topics rarely change Classifier may be inaccurate • Adequate if much better than random guessing of topic label • Can compensate errors using held-out validation data Background topic distribution What fraction of Web pages are about Health? Sampling via random walk • PageRank walk (Henzinger et al.) • Undirected regular walk (BarYossef et al.) Make graph undirected Add self-loops so that all nodes have the same degree Sample with large stride Collect topic histograms Convergence Stride=30k Stride=75k Distribution difference 0.4 Background distribution 0.3 0.2 0.1 Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions • L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2] • Below .05 —.2 within 300—400 physical pages Sports Society Shopping Science Reference Recreation Home Health 1000 Games 500 Computers Arts 0 Business 1 0.8 0.6 0.4 0.2 0 #hops 0 Biases in topic directories Use Dmoz to train a classifier Sample the Web Classify samples Diff Dmoz topic distribution from Web sample topic distribution Report maximum deviation in fractions NOTE: Not exactly Dmoz Dmoz over-represents Games.Video_Games Society.People Arts.Celebrities ...Education.Colleges ...Travel.Reservations Dmoz under-represents …WWW…Directories! Sports.Hockey Society.Philosophy Education…K12… Recreation…Camping Topic-specific degree distribution Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic More realistic: u has a topic, and links to v with related topics Unclear if power-law should be upheld Intra-topic linkage Inter-topic linkage Random forward walk without jumps /Arts/Music /Sports/Soccer 1.4 1 1.2 0.8 0.6 L_1 Distance L_1 Distance 1.2 1 0.8 From background From hop0 0.4 0.2 0.6 From background From hop0 0.4 0 5 10 15 Wander hops 20 0 5 10 15 Wander hops Sampling walk is designed to mix topics well How about walking forward without jumping? • Start from a page u0 on a specific topic • Forward random walk (u0, u1, …, ui, …) • Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution 20 Observations and implications Forward walks wander away from starting topic slowly But do not converge to the background distribution Global PageRank ok also for topic-specific queries • Jump parameter d=.1—.2 • Topic drift not too bad within path length of 5—10 • Prestige conferred mostly by same-topic neighbors W.p. d jump to a random node W.p. (1-d) jump to an out-neighbor u.a.r. Jump Also explains why focused crawling works Highprestige node Citation matrix Given a page is about topic i, how likely is it to link to topic j? • Matrix C[i,j] = probability that page about topic i links to page about topic j • Soft counting: C[i,j] += Pr(i|u)Pr(j|v) u Applications • Classifying Web pages into topics • Focused crawling for topic-specific pages • Finding relations between topics in a directory v Citation, confusion, correction From topic Arts Business Computers Games Health Home Recreation Reference Science Shopping Society Sports To topic Classifier’s confusion on held-out documents can be used to correct confusion matrix From topic To topic Guessed topic True topic Fine-grained views of citation Prominent off-diagonal entries raise design issues for taxonomy editors and maintainers Clear block-structure derived from coarse-grain topics Strong diagonals reflect tightly-knit topic communities Concluding remarks A model for content-based communities • New characterization and measurement of topical locality on the Web • How to set the PageRank jump parameter? • Topical stability of topic distillation • Better crawling and classification A tool for Web directory maintenance • Fair sampling and representation of topics • Block-structure and off-diagonals • Taxonomy inversion