Download Massive Data Sets: Theory & Practice

Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center 1 What are Massive Data Sets? Technology Science The World-Wide Web IP packet flows Phone call logs Genomic data Astronomical sky surveys Weather data Business • • • • • • Huge Credit card transactions Distributed Billing records Dynamic Supermarket sales Heterogeneous Noisy Unstructured / semi-structured Petabytes Terabytes Gigabytes 2 Nontraditional Challenges Traditionally Massive Data Sets Cope with the complexity of the problem Cope with the complexity of the data New challenges • How to efficiently compute on massive data sets? – Restricted access to the data – Not enough time to read the whole data – Tiny fraction of the data can be held in main memory • How to find desired information in the data? • How to summarize the data? • How to clean the data? 3 Computational Models for Massive Data Sets • Sampling Query a small number of data elements Data Set Algorithm • Data streams Stream through the data; limited main memory storage Data Set Algorithm • Sketching Compress data chunks into small “sketches”; compute over the sketches Data Set Algorithm 4 Outline of the Talk • Web statistics • Sampling lower bounds “Practice” • Hamming distance sketching “Theory” • Template detection 5 Web Statistics (with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000) • What fraction of the web is covered by Google? • Which is the largest country domain on the web? • What is the percentage of French language pages? • How large is the web? The “BowTie” Structure of the Web [Broder et al, 2000] crawlable web 6 Our Approach • Straightforward solution: – Crawl the crawlable web – Generate statistics based on the crawl • Drawbacks: – – – – Expensive Complicated implementation Slow Inaccurate • Our approach: uniform sampling by random walks – Random walk on an undirected & regular version of the crawlable web • Advantages: – Provably uniform samples from the crawlable web – Runs on a desktop PC in a couple of days 7 Undirected Regular Random Walk 3 5 Follow a random out-link or a random in-link at each step Use weighted self loops to even out page degrees w(v) = degmax - deg(v) 3 4 1 3 2 0 3 0 4 3 2 1 1 3 2 2 2 4 Fact: A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution. 8 Convergence Rate (“Mixing Time”) Theorem Mixing time  log(N)/ (N = graph size,  = transition matrix’s spectral gap) Experiment (based on a crawl) For the web,   10-5 Mixing time: 3.3 million steps • Self loop steps are free • 29,999 out of 30,000 steps are self loop steps Actual mixing time is only 110 steps 9 Realization of the Random Walk Problems • The in-links of pages are not readily available • The degree of pages is not available Available sources of in-links: • Previously visited nodes • Reverse link services of search engines Experiments indicate samples are still nearly uniform. 10 Top 20 Internet Domains (summer 2003) 60% 51.15% 50% 40% 30% 20% 10.36% 10% 9.19% 5.57% 4.15% 3.01% 0.61% x .m fo .in ov .g z .n .il l .p .c h .it l .n .c a .jp s .e s .u u .a k .u e .d du .e et .n rg .o .c om 0% 11 Search Engine Coverage (summer 2000) 80% 70% 60% 68% 54% 50% 50% 50% 48% 38% 40% 30% 20% 10% 0% Google AltaVista Fast Lycos HotBot Go 12 Subsequent Extensions • Focused Sampling (with T. Kanungo and R. Krauthgamer, 2003) – “Focused statistics” about web communities: • Statistics about the .uk domain • Statistics about pages on bicycling • Statistics about Arabic language pages – Based on a sophisticated extension of the above random walk. • Study of the web’s decay (with A. Broder, R. Kumar, and A. Tomkins, 2003) – A measure for how well-maintained web pages are. – Based on a random walk idea. 13 Sampling Lower Bounds (STOC 2003) Q1. How many samples are needed to estimate: – The fraction of pages covered by Google? – The number of distinct web-sites? – The distribution of languages on the web? Q2. Can we save samples by sampling non-uniformly? A2. For “symmetric” functions, uniform sampling is the best possible. (“symmetric” – invariant under permutations of data elements) A1. A “recipe” for obtaining sampling lower bounds for symmetric functions. 14 Optimality of Uniform Sampling (with R. Kumar and D. Sivakumar, STOC 2001) Theorem When estimating symmetric functions, uniform sampling is the best possible. Proof idea x X1 X2 X3 X4 X5 X6 X7 X8 (x) X2 X7 X5 original simulation algorithm Algorithm 15 Preliminaries f: A  B : symmetric function n : approximation parameter f(a) pairwise “disjoint inputs“ f(b) B f(c) input “sample distribution” x 1 1 1 2 2 3 (1) = 1/2 (2) = 1/3 (3) = 1/6 16 The Lower Bound Recipe x1,…,xm: “pairwise disjoint” inputs 1,…,m: “sample distributions” on x1,…,xm Theorem: Any algorithm approximating f requires q samples, where ( 0 · JS(1,…,m) · log m ) Proof steps: • Reduction from statistical classification • Lower bound for statistical classification 17 Reduction from Statistical Classification f: An  B: symmetric function pairwise “disjoint inputs” f(a) f(b) B f(c) Statistical classification: Given uniform samples from x  { a, b, c }, decide whether x = a or x = b or x = c. Can be solved by any sampling algorithm approximating f 18 The “Election Problem” • input: a sequence x of n votes to k parties • Want to get  s.t. || - x|| < . Theorem Vote Distribution x (n = 18, k = 6) A poll of size (k/2) is required for estimating the election problem. 7/18 4/18 3/18 2/18 1/18 1/18 19 Combinatorial Designs B1 U B3 B2 A family of subsets B1,…,Bm of a universe U s.t. 1. Each of them constitutes half of U. 2. The intersection of each two of them is relatively small. Fact: There exist designs of size exponential in |U|.20 Proof of the Lower Bound for the Election Problem Step 1: Identification of a set S of pairwise disjoint inputs: B1,…,Bm µ {1,…,k}: a design of size m = 2(k). S = { x1,…,xm }, where in xi: • ½ +  of the votes are split among parties in Bi. • ½ -  of the votes are split among parties in Bic. Bi Bic Step 2: JS(1,…,m) = O(2). By our theorem, # of queries is at least (k/2). 21 Hamming Distance Sketching (with T.S. Jayram and R. Kumar, 2003) Alice Bob $$ x y (x) (y) Ham(x,y) > k Ham(x,y) · k Referee 22 Hamming Distance Sketching Applications • Maintenance of large crawls • Comparison of large files over the network Previous schemes: • Sketches of size O(k2) [Kushilevitz, Ostrovsky, Rabani, 98], [Yao 03] • Lower bound: (k) Our scheme: • Sketches of size O(k log k) 23 Preliminaries • Using KOR scheme, can assume Ham(x,y) · 2k. Balls and Bins: • When throwing n balls into n/log n bins, then with high probability the fullest bin has O(log n) balls. • When throwing n balls into n2 bins, then with high probability no two balls fall into the same bin. 24 First Level Hashing x1 x2 x3 Ham(x,y) = k/log k bins i Ham(xi,yi) x 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 y 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1 8i, Ham(xi,yi) · O(log k) k/log k bins y1 y2 y3 25 Second Level Hashing 3,1 x3 0 0 0 1 1 1 y3 0 1 1 1 0 1 log2 k bins 3,2 3,3 3,4 3,5 3,6 3,1 3,2 3,3 3,j = 3,j iff # of “pink positions” in the j-th bin is even. • If no collisions, Ham(3,3) = Ham(x3,y3) • If collisions, Ham(3,3) · Ham(x3,y3) 3,4 3,5 log2 k bins 3,6 26 The Sketch • Probability of collision: a small constant • For each i = 1,…,k/log k, repeat second level hashing t = O(log k) times, obtaining (i1,i1),…,(it,it). • With probability at least 1 – 1/k, Ham(xi,yi) = maxj Ham(ij,ij) • • (x) = { ij | i = 1,…,k/log k, j = 1,…,t } (y) = { ij | i = 1,… k/log k, j = 1,…,t } • Referee decides Ham(x,y) · k if and only if i maxj Ham(ij, ij) · k 27 Other Sketching Results • A sketching scheme for the edit distance – Leads to the first almost-linear time approximation algorithm for the edit distance. • Sketch lower bounds for (compressed) pattern matching. 28 Template Detection (with S. Rajagopalan, WWW 2002) Template – Master HTML shell page used for composing new pages. Our contributions: • Efficient algorithm for template detection • Application to improvement of search engine precision 29 Templates are Bad for Web IR • Pose a significant source of “noise” in web pages – Their content is not related to the topics of pages in which they reside – Create spurious linkage to unimportant pages • Extremely common – Became standard in website design 30 Pagelets [Chakrabarti 01] Pagelet – a region in a page that: News headlines pagelet Navigational bar pagelet • has a single theme Search pagelet • not nested within a bigger region with the same theme Directory pagelet 31 Template Definition Template = a collection of pagelets that: 1. Belong to the same website. 2. Are nearly-identical. 32 Template Detection Template Detection Problem: Given a set of pages S, find all the templates in S. Template Detection Algorithm • Group the pages in S according to website. • For each website w: – For each page p 2 w: • Partition p into pagelets p1,…,pk • Compute a “shingle” sketch for each pagelet [Broder et al. 1997] – Group the resulting pagelets by their sketches. – Output all the pagelet groups of size > 1. 33 HITS & Clever [Kleinberg 1997, Chakrabarti et al. 1998] Hubs Authorities h(p) = q 2 out(p) a(q) a(p) = q 2 in(p) h(q) 34 “Template” Clever Legend Page Pagelet Hubs Authorities Templatized pagelet • Hubs – all the non-templatized constituent pagelets of pages in the base set. • Authorities – all pages in the base set. 35 Classical Clever vs. Template Clever Average Precision @ 50 for broad queries 120 Precsion 100 80 Classical Clever 60 Template Clever 40 20 0 10 20 30 40 50 36 Template Proliferation recycling_cans Template Frequency for ARC Set Queries gardening mutual_funds 0.7 java Zener San_Francisco 0.6 field_hockey Penelope_Fitzgerald HIV bicycling 0.5 affirmative_action amusement_parks Thailand_tourism 0.4 cruises volcano stamp_collecting 0.3 architecture Shakespeare Gulf_war 0.2 zen_buddhism lyme_disease Death_Valley citrus_groves 0.1 cheese table_tennis blues 0 classical_guitar telecommuting parallel_architecture 37 Summary • Web data mining via random walks on the web graph: – Web statistics – Focused statistics – Web decay • Sampling lower bounds – Optimality of uniform sampling for symmetric functions – A “recipe” for lower bounds • Sketching of string distance measures – Hamming distance – Edit distance • Template detection 38 Some of My Other Work • Database – Semi-structured data and XML • Computational Complexity – – – – Communication complexity Pseudo-randomness and de-randomization Space-bounded computations Parallel computation complexity • Algorithm Design – Data stream algorithms – Internet auctions 39 40 Web Statistics (with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000) • What fraction of the web is covered by Google? • Which is the largest country domain on the web? • What is the percentage of porn pages? • How large is the web? IN SCC OUT The “BowTie” Structure of the Web [Broder et al, 2000] crawlable 41 web Straightforward Random Walk amazon.com Follow a random out-link at each step yahoo.com 4 7 1 3 6 9 5 8 2 www.almaden.ibm.com/cs/people/ziv • Gets stuck in sinks and in dense web communities • Biased towards popular pages • Converges slowly, if at all 42 Undirected Regular Random Walk 3 5 Follow a random out-link or a random in-link at each step Use weighted self loops to even out page degrees w(v) = degmax - deg(v) amazon.com 3 1 yahoo.com 4 2 0 3 0 4 3 2 1 1 3 3 2 2 2 4 www.almaden.ibm.com/cs/people/ziv Fact: A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution. 43 Evaluation: Bias towards High Degree Nodes Percent of nodes from walk High Degree Deciles of nodes ordered by degree Low Degree 44 Evaluation: Bias towards the Search Engines Estimate of search engine size 30% 50% Search engine size 45 Link-Based Web IR Applications • Search and ranking – HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998] – PageRank [Brin and Page 1998] – SALSA [Lempel and Moran 2000] • Similarity search – Co-Citation [Dean and Henzinger 1999] • Categorization – Hyperclass [Chakrabarti, Dom, Indyk 1998] • Focused crawling – FOCUS [Chakrabarti, van der Berg, Dom 1999] • … 46 Hypertext IR Principles Underlying principles of link analysis: • Relevant Linkage Principle [Kleinberg 1997] – p links to q  q is relevant to p p q • Topical Unity Principle [Kessler 1963, Small 1973] – q1 and q2 are co-cited in p  q1 and q2 are related to each other q1 p q2 • Lexical Affinity Principle [Maarek et al. 1991] – The closer the links to q1 and q2 are the stronger the relation between them. p q1 q2 q3 47 Example: HITS & Clever [Kleinberg 1997, Chakrabarti et al. 1998] Hubs Authorities • Relevant Linkage Principle – All links propagate score from hubs to authorities and vice versa. • Topical Unity Principle – Co-cited authorities propagate score to each other. h(p) = q 2 out(p) a(q) a(p) = q 2 in(p) h(q) • Lexical Affinity Principle (Clever) – Text around links is used to weight relevance of the links. 48

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Massive Data Sets: Theory & Practice