Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
BINGO!: Bookmark-Induced
Gathering of Information
Sergej Sizov, Martin Theobald,
Stefan Siersdorfer, Gerhard Weikum
University of the Saarland
Germany
Part I
System Overview
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Motivation
Web search engines
The vector space model
Link analysis & authority ranking
Information demands
Mass queries
(“madonna tour”)
Needle-in-a-haystack queries
(“solidarity eisler”)
BINGO!: Bookmark-Induced Gathering of Information
?
Sergej Sizov
Overview (II)
WWW
ROOT
Semistructured
Data
Web
Retrieval
DB Core
Technology
Data
Mining
BINGO!: Bookmark-Induced Gathering of Information
Networking
Workflow and
E-Services
XML
Sergej Sizov
Focused Crawling
Crawler
Queue
Classifier
Results
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Focused Crawling (2)
Key aspects:
the mathematical model and algorithm that
are used for the classifier
(e.g., Naive Bayes vs. SVM)
the feature set upon which the classifier
makes its decision
(e.g., all terms vs. a careful selection of the "most
discriminative" terms)
the quality of the training data
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Focused Crawling (3)
Crawler
SVM Classifier
Queue
HITS
Hubs
Authorities
Re-Training
SVM Archetypes
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
System Overview
......
.....
......
.....
WWW
Crawler
URL
Queue
Document
Analyzer
Docs
Classifier
Feature
Selection
Feature
Vectors
Bookmarks
BINGO!: Bookmark-Induced Gathering of Information
Ontology
Index
Adaptive
Re-Training
Link
Analyzer
Training
Docs
Hubs &
Authorities
Sergej Sizov
Part II
System Components
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Focus Manager
Focusing strategies
Depth-first (df):
P (j)= depth(j)+pos(j) /links(j)
bf
(confidence(j)+1)2
Breadth-first (bf):
pos(j)
2
P (j)=- depth(j)+
×(confidence(j)+1)
df
links(j)
Strong focus (learning phase)
Soft focus (harvesting phase)
Tunneling
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Focus Manager (2)
Sample URL Prioritization
confidence = 0.4
1 topic=A
confidence = 0.85
2
topic=A
5
confidence = 0.3
4 topic=A
3 confidence = 0.6
topic=B
6
DF strong order:
BF strong order:
DF soft order:
BF soft order:
7
8
9
10
1–2–5–3–6–4–9–10 ..
1–2–5–3–4–6–9–10 ..
1–2–5–6–3–7–8–4–9–10 ..
1–2–5–3–6–4–7–8–9–10 ..
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Feature Selection
Mutual Information (MI) criterion:
P[ X i V j ]
MI( X ,V ) P[ X V ] log
i j
i
j
P[ X i ] P[V j ]
A N
MI( X ,V ) A log
i j
N
A B ( AC )
A
B
C
N
is the number of documents in Vj containing Xi,
is the number of documents with Xi in "competitive" topics
is the number of documents in Vj without Xi
is the overall number of documents in Vj and its competitive topics
Time complexity: O(n)+O(mk) for n documents, m
terms and k competitive topic.
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Feature Selection (2)
Top features for the topic “DB Core Technology"
with regard to tf*idf (left) and MI (right)
tf*idf score
below
et
graph
involv
accomplish
backup
command
exactli
feder
histor
1.4927
1.2778
1.2446
1.0406
0.9491
0.8613
0.8567
0.8112
0.7764
0.6822
BINGO!: Bookmark-Induced Gathering of Information
MI weight
storag
modifi
sql
disk
pointer
deadlock
redo
implement
correctli
size
0.1428
0.1258
0.1209
0.1179
0.1150
0.1001
0.1001
0.0963
0.0911
0.0911
Sergej Sizov
Classifier
x2
?
V
w x b 0
δ
σ
δ
Input:
¬V
n training vectors with
components (x1, ..., xm, C)
and C = +1 or C = -1
x1
Training: Compute w x b0
Classification: Check w y b 0
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Hierarchical Classification
Recursive classification by the taxonomy tree.
Decisions based on topic-specific feature
spaces
ROOT
0.8
0.1
Semistructured
Data
0.2
Web
Retrieval
DB Core
Technology
0.2
Networking
-0.5
Workflow and
E-Services
-0.7
0.4
Data
Mining
BINGO!: Bookmark-Induced Gathering of Information
XML
Sergej Sizov
Link Analysis
Web graph
G = (S, E)
The HITS Algorithm
Authority Score :
Hub Score :
xq
yq
yp
xp
( p ,q ) E
( p ,q ) E
Iterative approximation of the dominant
Eigenvectors of ATA and AAT:
x AT y
y Ax
BINGO!: Bookmark-Induced Gathering of Information
?
T
T
x : A y : A A x
y : A x : A AT y
Sergej Sizov
Retraining based on Archetypes
Two sources of potential archetypes:
Link analysis → Nauth good authorities
SVM classifier → Nconf best-rated docs
To avoid the "topic drift" phenomenon: the
classification confidence of an archeteype
must be higher than the mean confidence of
the previous iteration's training documents.
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Retraining (2)
if {at least one topic has more than
Nmax positive documents or all topics have more than
Nmin positive documents} {
for each topic Vi {
link analysis using all documents of Vi as base set;
hubs (Vi) = top Nhub documents;
authorities (Vi) = top Nauth documents;
sort docs of Vi in descending order of confidence;
archetypes (Vi) = top Nconf from confidence ranking
auth (Vi);
remove from archetypes(Vi) all docs with
confidence < mean of the previous iteration;
archetypes (Vi) = archetypes(Vi) bookmarks (Vi) };
for each topic Vi {
perform feature selection based on archetypes (Vi);
re-compute SVM decision model for Vi }
re-initialize URL queue using hubs (Vi) to URL queue } }
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Part III
Evaluation
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Testbed
Bookmarks: homepages of researchers in the various areas
Leaf nodes were filled with 9 -15 bookmarks
The total training data comprised 81 documents
Focused crawl:
Crawling time: 6h
Visited: 11000 pages (1800 hosts), link distances 1 – 7
4230 positively classified (675 different hosts)
Entire crawl: 7 iterations with re-training.
Parameters:
Nmin = 50, Nmax = 200,
Nhub = 50, Nauth = 20, Nconf = 20.
Feature selection: MI criterion, best 300 for each topic;
Authority ranking: HITS algorithm
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Crawling Precision
Entire
ontology
Iteration
Data Mining
XML
1
0,98
0,94
0,98
2
0,98
0,93
0,98
3
0,99
0,97
0,96
4
0,87
0,99
0,97
5
0,90
0,95
0,96
6
0,98
0,98
0,95
7
0,94
0,97
0,96
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Crawling Precision (2)
with
focusing,
no MI
no
focusing,
no MI
Iteration
BINGO!
1
0,98
0.89
0.84
2
0,98
0.86
0.86
3
0,96
0.75
0.79
4
0,97
0.78
0.73
5
0,96
0.55
0.63
6
0,95
0.54
0.52
7
0,96
0.63
0.50
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Crawling Recall
Entire
ontology
Iteration
Data Mining
XML
1
307
117
807
2
552
343
1615
3
1092
396
2436
4
1553
442
3245
5
2071
562
4072
6
2678
627
4898
7
3027
701
5715
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Archetype Selection
Topic „Data Mining“:
URL
SVM confidence
http://www.it.iitb.ernet.in/~sunita/it642/
http://www.research.microsoft.com/research/datamine/
http://www.acm.org/sigs/sigkdd/explorations/
http://robotics.stanford.edu/users/ronnyk/
http://www.kdnuggets.com/index.html
http://www.wizsoft.com/
http://www.almaden.ibm.com/cs/people/ragrawal/
http://www.cs.sfu.ca/~han/DM_Book.html
http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
http://www.cs.cornell.edu/johannes/publications.html
BINGO!: Bookmark-Induced Gathering of Information
1.35
1.31
1.28
1.24
1.18
1.16
1.14
1.14
1.14
0.78
Sergej Sizov
Archetype Selection (2)
Iteration
Data Mining
XML
Entire
ontology
1
10 (1)
5 (0)
24 (4)
2
10 (2)
11 (0)
27 (5)
3
9 (1)
17 (1)
32 (4)
4
8 (0)
7 (0)
29 (3)
5
22 (2)
26 (2)
62 (8)
6
43 (4)
12 (2)
77 (10)
7
38 (0)
13 (1)
75 (8)
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Feature Selection
Topic „Data Mining“:
Feature
mine
knowledg
olap
frame
pattern
genet
discov
miner
cluster
dataset
BINGO!: Bookmark-Induced Gathering of Information
MI weight
0.178
0.122
0.106
0.086
0.066
0.061
0.053
0.053
0.049
0.044
Sergej Sizov
Future Work
Large-scale experiments (portal generator)
Annotation and semantic classification of HTML
sources (e.g. transformation of HTML to XML for
improved data management, detection of
“information units”)
Advanced feature construction and feature
selection algorithms
Fault tolerance on document collections with
wrong samples, adaptive re-training
... ?
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Crawler
Key features:
asynchronous DNS lookups with caching
multiple download attempts
advanced duplicate recognition
following multiple redirects
advanced topic-balanced URL-queue
document filters for common datatypes
focusing strategies
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Classifier (II)
Training:
Find hyperplane w x b 0 that separates the samples
with maximum margin (quadratic optimization task):
n
1
minimize : V ( ,b, ) C i
2
i 1
subj . to :
in1
yi [ x i b ] 1 i
in1
i 0
Classification:
Test unlabeled vector y for w y b 0
Very efficient runtime in O(m)
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov
Related Work
General-purpose crawling
Focused crawling
Authority ranking
Classification of Web documents
Web ontologies
BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov