Download Massive Data Sets: Theory & Practice

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Theoretical computer science wikipedia , lookup

Neuroinformatics wikipedia , lookup

Randomness wikipedia , lookup

Pattern recognition wikipedia , lookup

Hardware random number generator wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Corecursion wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Massive Data Sets:
Theory & Practice
Ziv Bar-Yossef
IBM Almaden Research Center
1
What are Massive Data Sets?
Technology
Science
The World-Wide Web
IP packet flows
Phone call logs
Genomic data
Astronomical sky surveys
Weather data
Business
•
•
•
•
•
•
Huge
Credit card transactions
Distributed
Billing records
Dynamic
Supermarket sales
Heterogeneous
Noisy
Unstructured / semi-structured
Petabytes
Terabytes
Gigabytes
2
Nontraditional Challenges
Traditionally
Massive Data Sets
Cope with the complexity
of the problem
Cope with the complexity
of the data
New challenges
• How to efficiently compute on massive data sets?
– Restricted access to the data
– Not enough time to read the whole data
– Tiny fraction of the data can be held in main memory
• How to find desired information in the data?
• How to summarize the data?
• How to clean the data?
3
Computational Models for
Massive Data Sets
• Sampling
Query a small number of data elements
Data Set
Algorithm
• Data streams
Stream through the data;
limited main memory storage
Data Set
Algorithm
• Sketching
Compress data chunks into small “sketches”;
compute over the sketches
Data Set
Algorithm
4
Outline of the Talk
• Web statistics
• Sampling lower bounds
“Practice”
• Hamming distance sketching
“Theory”
• Template detection
5
Web Statistics
(with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000)
• What fraction of the web is covered by Google?
• Which is the largest country domain on the web?
• What is the percentage of French language pages?
• How large is the web?
The “BowTie” Structure
of the Web
[Broder et al, 2000]
crawlable
web
6
Our Approach
• Straightforward solution:
– Crawl the crawlable web
– Generate statistics based on the crawl
• Drawbacks:
–
–
–
–
Expensive
Complicated implementation
Slow
Inaccurate
• Our approach: uniform sampling by random walks
– Random walk on an undirected & regular version of the
crawlable web
• Advantages:
– Provably uniform samples from the crawlable web
– Runs on a desktop PC in a couple of days
7
Undirected Regular Random Walk
3
5
Follow a random
out-link or a random
in-link at each step
Use weighted self
loops to even out
page degrees
w(v) = degmax - deg(v)
3
4
1
3
2
0
3
0
4
3
2
1
1
3
2
2
2
4
Fact:
A random walk on a connected (non-bipartite) undirected
regular graph converges to a uniform limit distribution.
8
Convergence Rate (“Mixing Time”)
Theorem
Mixing time  log(N)/
(N = graph size,  = transition matrix’s spectral gap)
Experiment (based on a crawl)
For the web,
  10-5
Mixing time: 3.3 million steps
• Self loop steps are free
• 29,999 out of 30,000 steps are self loop steps
Actual mixing time is only 110 steps
9
Realization of the Random Walk
Problems
• The in-links of pages are not readily available
• The degree of pages is not available
Available sources of in-links:
• Previously visited nodes
• Reverse link services of search engines
Experiments indicate samples are still nearly uniform.
10
Top 20 Internet Domains
(summer 2003)
60%
51.15%
50%
40%
30%
20%
10.36%
10%
9.19%
5.57%
4.15%
3.01%
0.61%
x
.m
fo
.in
ov
.g
z
.n
.il
l
.p
.c
h
.it
l
.n
.c
a
.jp
s
.e
s
.u
u
.a
k
.u
e
.d
du
.e
et
.n
rg
.o
.c
om
0%
11
Search Engine Coverage
(summer 2000)
80%
70%
60%
68%
54%
50%
50%
50%
48%
38%
40%
30%
20%
10%
0%
Google AltaVista
Fast
Lycos
HotBot
Go
12
Subsequent Extensions
• Focused Sampling
(with T. Kanungo and R. Krauthgamer, 2003)
– “Focused statistics” about web communities:
• Statistics about the .uk domain
• Statistics about pages on bicycling
• Statistics about Arabic language pages
– Based on a sophisticated extension of the
above random walk.
• Study of the web’s decay
(with A. Broder, R. Kumar, and A. Tomkins, 2003)
– A measure for how well-maintained web pages are.
– Based on a random walk idea.
13
Sampling Lower Bounds
(STOC 2003)
Q1. How many samples are needed to estimate:
– The fraction of pages covered by Google?
– The number of distinct web-sites?
– The distribution of languages on the web?
Q2. Can we save samples by sampling non-uniformly?
A2. For “symmetric” functions, uniform sampling is the best
possible.
(“symmetric” – invariant under permutations of data elements)
A1. A “recipe” for obtaining sampling lower bounds for
symmetric functions.
14
Optimality of Uniform Sampling
(with R. Kumar and D. Sivakumar, STOC 2001)
Theorem
When estimating symmetric functions, uniform sampling
is the best possible.
Proof idea
x
X1 X2 X3 X4 X5 X6 X7 X8
(x) X2 X7 X5
original
simulation
algorithm
Algorithm
15
Preliminaries
f: A  B : symmetric function
n
: approximation parameter
f(a)
pairwise “disjoint inputs“
f(b)
B
f(c)
input “sample distribution”
x 1 1 1 2 2 3
(1) = 1/2
(2) = 1/3
(3) = 1/6
16
The Lower Bound Recipe
x1,…,xm: “pairwise disjoint” inputs
1,…,m: “sample distributions” on x1,…,xm
Theorem:
Any algorithm approximating f requires q samples, where
( 0 · JS(1,…,m) · log m )
Proof steps:
• Reduction from statistical classification
• Lower bound for statistical classification
17
Reduction from Statistical Classification
f: An  B: symmetric function
pairwise “disjoint inputs”
f(a)
f(b)
B
f(c)
Statistical classification:
Given uniform samples from x  { a, b, c }, decide whether
x = a or x = b or x = c.
Can be solved by any sampling algorithm
approximating f
18
The “Election Problem”
• input: a sequence x of n votes to k parties
• Want to get  s.t. || - x|| < .
Theorem
Vote Distribution x
(n = 18, k = 6)
A poll of size (k/2) is
required for estimating the
election problem.
7/18 4/18 3/18 2/18 1/18 1/18
19
Combinatorial Designs
B1
U
B3
B2
A family of subsets B1,…,Bm of a universe U s.t.
1. Each of them constitutes half of U.
2. The intersection of each two of them is
relatively small.
Fact: There exist designs of size exponential in |U|.20
Proof of the Lower Bound for the
Election Problem
Step 1: Identification of a set S of pairwise disjoint inputs:
B1,…,Bm µ {1,…,k}: a design of size m = 2(k).
S = { x1,…,xm }, where in xi:
• ½ +  of the votes are split
among parties in Bi.
• ½ -  of the votes are split
among parties in Bic.
Bi
Bic
Step 2: JS(1,…,m) = O(2).
By our theorem, # of queries is at least (k/2).
21
Hamming Distance Sketching
(with T.S. Jayram and R. Kumar, 2003)
Alice
Bob
$$
x
y
(x)
(y)
Ham(x,y) > k
Ham(x,y) · k
Referee
22
Hamming Distance Sketching
Applications
• Maintenance of large crawls
• Comparison of large files over the network
Previous schemes:
• Sketches of size O(k2)
[Kushilevitz, Ostrovsky, Rabani, 98], [Yao 03]
• Lower bound: (k)
Our scheme:
• Sketches of size O(k log k)
23
Preliminaries
• Using KOR scheme, can assume Ham(x,y) · 2k.
Balls and Bins:
• When throwing n balls into n/log n bins, then
with high probability the fullest bin has O(log n)
balls.
• When throwing n balls into n2 bins, then with
high probability no two balls fall into the same
bin.
24
First Level Hashing
x1
x2
x3
Ham(x,y) =
k/log k
bins
i Ham(xi,yi)
x
1
0
0
1
1
1
0
0 0
1
0
0
1
1
0
0
y
1
1
0
1
0
1
0
1 1
1
0
0
1
0
0
1
8i, Ham(xi,yi) ·
O(log k)
k/log k
bins
y1
y2
y3
25
Second Level Hashing
3,1
x3
0
0
0
1
1
1
y3
0
1
1
1
0
1
log2 k bins
3,2
3,3
3,4
3,5
3,6
3,1
3,2
3,3
3,j = 3,j iff # of “pink
positions” in the j-th bin
is even.
• If no collisions,
Ham(3,3) =
Ham(x3,y3)
• If collisions,
Ham(3,3) ·
Ham(x3,y3)
3,4
3,5
log2
k bins
3,6
26
The Sketch
• Probability of collision: a small constant
• For each i = 1,…,k/log k, repeat second level hashing
t = O(log k) times, obtaining (i1,i1),…,(it,it).
• With probability at least 1 – 1/k,
Ham(xi,yi) = maxj Ham(ij,ij)
•
•
(x) = { ij | i = 1,…,k/log k, j = 1,…,t }
(y) = { ij | i = 1,… k/log k, j = 1,…,t }
• Referee decides Ham(x,y) · k if and only if
i maxj Ham(ij, ij) · k
27
Other Sketching Results
• A sketching scheme for the edit distance
– Leads to the first almost-linear time
approximation algorithm for the edit distance.
• Sketch lower bounds for (compressed)
pattern matching.
28
Template Detection
(with S. Rajagopalan, WWW 2002)
Template – Master HTML shell page used for composing
new pages.
Our contributions:
• Efficient algorithm for template detection
• Application to improvement of search engine precision
29
Templates are Bad for Web IR
• Pose a significant source of “noise” in web
pages
– Their content is not related to the topics of pages
in which they reside
– Create spurious linkage to unimportant pages
• Extremely common
– Became standard in website design
30
Pagelets
[Chakrabarti 01]
Pagelet – a region
in a page that:
News
headlines
pagelet
Navigational
bar pagelet
• has a single theme
Search
pagelet
• not nested within a
bigger region with
the same theme
Directory
pagelet
31
Template Definition
Template = a collection of pagelets that:
1. Belong to the same website.
2. Are nearly-identical.
32
Template Detection
Template Detection Problem:
Given a set of pages S, find all the templates in S.
Template Detection Algorithm
• Group the pages in S according to website.
• For each website w:
– For each page p 2 w:
• Partition p into pagelets p1,…,pk
• Compute a “shingle” sketch for each pagelet [Broder et
al. 1997]
– Group the resulting pagelets by their sketches.
– Output all the pagelet groups of size > 1.
33
HITS & Clever
[Kleinberg 1997, Chakrabarti et al. 1998]
Hubs
Authorities
h(p) = q 2 out(p) a(q)
a(p) = q 2 in(p) h(q)
34
“Template” Clever
Legend
Page
Pagelet
Hubs
Authorities
Templatized pagelet
• Hubs – all the non-templatized constituent pagelets
of pages in the base set.
• Authorities – all pages in the base set.
35
Classical Clever vs. Template Clever
Average Precision @ 50 for broad queries
120
Precsion
100
80
Classical Clever
60
Template Clever
40
20
0
10
20
30
40
50
36
Template Proliferation
recycling_cans
Template Frequency for ARC Set Queries
gardening
mutual_funds
0.7
java
Zener
San_Francisco
0.6
field_hockey
Penelope_Fitzgerald
HIV
bicycling
0.5
affirmative_action
amusement_parks
Thailand_tourism
0.4
cruises
volcano
stamp_collecting
0.3
architecture
Shakespeare
Gulf_war
0.2
zen_buddhism
lyme_disease
Death_Valley
citrus_groves
0.1
cheese
table_tennis
blues
0
classical_guitar
telecommuting
parallel_architecture
37
Summary
• Web data mining via random walks on the web graph:
– Web statistics
– Focused statistics
– Web decay
• Sampling lower bounds
– Optimality of uniform sampling for symmetric functions
– A “recipe” for lower bounds
• Sketching of string distance measures
– Hamming distance
– Edit distance
• Template detection
38
Some of My Other Work
• Database
– Semi-structured data and XML
• Computational Complexity
–
–
–
–
Communication complexity
Pseudo-randomness and de-randomization
Space-bounded computations
Parallel computation complexity
• Algorithm Design
– Data stream algorithms
– Internet auctions
39
40
Web Statistics
(with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000)
• What fraction of the web is covered by Google?
• Which is the largest country domain on the web?
• What is the percentage of porn pages?
• How large is the web?
IN
SCC
OUT
The “BowTie” Structure
of the Web
[Broder et al, 2000]
crawlable 41
web
Straightforward Random Walk
amazon.com
Follow a random
out-link at each
step
yahoo.com
4
7
1
3
6
9
5
8
2
www.almaden.ibm.com/cs/people/ziv
• Gets stuck in sinks and in dense web communities
• Biased towards popular pages
• Converges slowly, if at all
42
Undirected Regular Random Walk
3
5
Follow a random
out-link or a random
in-link at each step
Use weighted self
loops to even out
page degrees
w(v) = degmax - deg(v)
amazon.com
3
1
yahoo.com
4
2
0
3
0
4
3
2
1
1
3
3
2
2
2
4
www.almaden.ibm.com/cs/people/ziv
Fact:
A random walk on a connected (non-bipartite) undirected
regular graph converges to a uniform limit distribution.
43
Evaluation:
Bias towards High Degree Nodes
Percent of
nodes from
walk
High
Degree
Deciles of nodes ordered by degree
Low
Degree
44
Evaluation:
Bias towards the Search Engines
Estimate
of search
engine
size
30%
50%
Search engine size
45
Link-Based Web IR Applications
• Search and ranking
– HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998]
– PageRank [Brin and Page 1998]
– SALSA [Lempel and Moran 2000]
• Similarity search
– Co-Citation [Dean and Henzinger 1999]
• Categorization
– Hyperclass [Chakrabarti, Dom, Indyk 1998]
• Focused crawling
– FOCUS [Chakrabarti, van der Berg, Dom 1999]
• …
46
Hypertext IR Principles
Underlying principles of link analysis:
• Relevant Linkage Principle [Kleinberg 1997]
– p links to q  q is relevant to p
p
q
• Topical Unity Principle [Kessler 1963, Small 1973]
– q1 and q2 are co-cited in p  q1 and q2 are related to each other
q1
p
q2
• Lexical Affinity Principle [Maarek et al. 1991]
– The closer the links to q1 and q2 are the stronger the
relation between them.
p
q1
q2
q3
47
Example: HITS & Clever
[Kleinberg 1997, Chakrabarti et al. 1998]
Hubs
Authorities
• Relevant Linkage Principle
– All links propagate score from hubs
to authorities and vice versa.
• Topical Unity Principle
– Co-cited authorities propagate
score to each other.
h(p) = q 2 out(p) a(q)
a(p) = q 2 in(p) h(q)
• Lexical Affinity Principle (Clever)
– Text around links is used to weight
relevance of the links.
48