Download p338a - CSE, IIT Bombay

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Structure of Broad Topics
on the Web
Soumen Chakrabarti
Mukul M. Joshi
Kunal Punera
(IIT Bombay)
David M. Pennock
(NEC Research Institute)
Graph structure of the Web
 Over two billion nodes, two trillion links
 Power-law degree distribution
• Pr(degree = k)  1/k2.1
 Looks like a “bow-tie” at large scale
IN
Strongly
connected
core (SCC)
OUT
“This is
the Web”
The need for content-based models
 Why does a radius-1
expansion help in
topic distillation?
Query
Search
engine
Root
set
Crawler
 Why does topicspecific focused
crawling work?
Classifier
Check
Prune
frontier topic if irrelevant
 Why is a global
d
p(u )
PageRank useful for p(v)  N  (1  d ) OutDegree(u )
u v
specific queries?

Uniform
jump
Walk to
out-neighbor
The need for content-based models
 How are different topics linked to each other?
 Are topic directories representative of Web
topic populations?
 Are standard collections (e.g., TREC W10G)
representative of Web topics?
“This is
the Web
with topics”
How to characterize “topics”
 Web directories—most natural choice
 Started with http://dmoz.org
 Keep pruning until all leaf topics Test doc
have enough (>300) samples
Classifier
 Approx 120k sample URLs
Topic
Prob
Arts
0.1
 Flatten to approx 482 topics
Computers
0.3
0.6
 Train text classifier (Rainbow) Science
 Characterize new document d as a
vector of probabilities pd = (Pr(c|d) c)
Critique and defense
 Cannot capture fine-grained or
emerging topics
• Emerging topics most often specialize
existing broad topics
• Broad topics rarely change
 Classifier may be inaccurate
• Adequate if much better than random
guessing of topic label
• Can compensate errors using held-out
validation data
Background topic distribution
 What fraction of Web pages
are about Health?
 Sampling via random walk
• PageRank walk (Henzinger et al.)
• Undirected regular walk (BarYossef et al.)
 Make graph undirected
 Add self-loops so that all nodes
have the same degree
 Sample with large stride
 Collect topic histograms
Convergence
Stride=30k
Stride=75k
Distribution
difference
0.4
Background distribution
0.3
0.2
0.1
 Start from pairs of diverse topics
 Two random walks, sample from each walk
 Measure distance between topic distributions
• L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]
• Below .05 —.2 within 300—400 physical pages
Sports
Society
Shopping
Science
Reference
Recreation
Home
Health
1000
Games
500
Computers
Arts
0
Business
1
0.8
0.6
0.4
0.2
0
#hops
0
Biases in topic directories
 Use Dmoz to train a
classifier
 Sample the Web
 Classify samples
 Diff Dmoz topic
distribution from Web
sample topic distribution
 Report maximum
deviation in fractions
 NOTE: Not exactly Dmoz
Dmoz over-represents
Games.Video_Games
Society.People
Arts.Celebrities
...Education.Colleges
...Travel.Reservations
Dmoz under-represents
…WWW…Directories!
Sports.Hockey
Society.Philosophy
Education…K12…
Recreation…Camping
Topic-specific degree distribution
 Preferential
attachment:
connect u to v w.p.
proportional to the
degree of v,
regardless of topic
 More realistic: u has
a topic, and links to v
with related topics
 Unclear if power-law
should be upheld
Intra-topic
linkage
Inter-topic
linkage
Random forward walk without jumps
/Arts/Music
/Sports/Soccer
1.4
1
1.2
0.8
0.6
L_1 Distance
L_1 Distance
1.2
1
0.8
From background
From hop0
0.4
0.2
0.6
From background
From hop0
0.4
0
5
10
15
Wander hops
20
0
5
10
15
Wander hops
 Sampling walk is designed to mix topics well
 How about walking forward without jumping?
• Start from a page u0 on a specific topic
• Forward random walk (u0, u1, …, ui, …)
• Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with
the background distribution
20
Observations and implications
 Forward walks wander away from
starting topic slowly
 But do not converge to the
background distribution
 Global PageRank ok also
for topic-specific queries
• Jump parameter d=.1—.2
• Topic drift not too bad within
path length of 5—10
• Prestige conferred mostly by
same-topic neighbors
W.p. d jump to
a random node
W.p. (1-d)
jump to an
out-neighbor
u.a.r.
Jump
 Also explains why focused crawling works
Highprestige
node
Citation matrix
 Given a page is about topic i, how likely
is it to link to topic j?
• Matrix C[i,j] = probability that page about
topic i links to page about topic j
• Soft counting: C[i,j] += Pr(i|u)Pr(j|v) u
 Applications
• Classifying Web pages into topics
• Focused crawling for topic-specific pages
• Finding relations between topics in a
directory
v
Citation, confusion, correction
From topic
Arts
Business
Computers
Games
Health
Home
Recreation
Reference
Science
Shopping
Society
Sports
To topic 
Classifier’s confusion
on held-out documents
can be used to correct
confusion matrix
From topic
To topic 
Guessed topic 
True topic
Fine-grained views of citation
Prominent off-diagonal
entries raise design
issues for taxonomy
editors and maintainers
Clear block-structure derived
from coarse-grain topics
Strong diagonals reflect
tightly-knit topic communities
Concluding remarks
 A model for content-based communities
• New characterization and measurement of
topical locality on the Web
• How to set the PageRank jump parameter?
• Topical stability of topic distillation
• Better crawling and classification
 A tool for Web directory maintenance
• Fair sampling and representation of topics
• Block-structure and off-diagonals
• Taxonomy inversion