Download Document

Pig Latin & RDF Data Analysis The MD-join: An Operator for Complex OLAP, ICDE 2001 MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD 2008 RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web, ISWC 2009 Towards Scalable RDF Graph Analytics on MapReduce, MDAC 2010 Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea 2010-08-20 presented by Jaeseok Myung MapReduce Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea MapReduce  MapReduce: Simplified Data Processing on Large Cluster, 2004  Map: (k1, v1) -> list (k2, v2)  Reduce: (k2, list (v2)) -> (k3, v3) JTeam, http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/ Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 3/38 Programming with MapReduce map (String key, String value): // key : document name // value : document contents for each word w in value: EmitIntermediate (w, “1”); reduce (String key, Iterator values): // key : a word // values : a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 4/38 Why Is MapReduce Useful?  Scalability  Capable of handling large amount of data  Extensibility  Commodity machines  Shared-nothing architecture  Fault-tolerance  Writing intermediate results into the physical storage  Flexibility  Enabling users to write custom codes Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 5/38 Pig Latin Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Motivation  SQL wrench programmers away from their preferred method of analyzing data, namely writing imperative scripts or code, toward writing declarative queries in SQL, which they often find unnatural, and overly restrictive  MapReduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse  We have developed a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and low-level, procedural style of MapReduce Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 7/38 Pig Latin - Example Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 8/38 Pig Latin  Data Model  Atom, Tuple, Bag, Map  Language  Expressions  Operators -> MapReduce – LOAD, STORE, SPLIT – FOREACH, FILTER – COGROUP, GROUP, JOIN – Aggregate Functions – UDF (User Defined Function) Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 9/38 Data Model  Atom   Tuple   A tuple is a sequence of fields, each of which can be any of the data types, e.g., (‘alice’, ‘lakers’) Bag   An atom contains a simple atomic value such as a string or a number, e.g., ‘alice’ A bag is a collection of tuples with possible duplicates Map  A map is a collection of data items, where each item has an associated key through which it can be looked up Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 10/38 Expressions in Pig Latin Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 11/38 RDF Data Analysis Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea RDF Data Model Graph representation Statements (triples) Sub Prop Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 R1 avgDuration 97 UV1 type UserVisits UV1 srcIP 158.112.27.3 UV1 destURL url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 UV1 userAgent SCOPE UV1 cCode VNM UV1 iCode VNM-KH UV1 sKeyword comets UV1 avgTime 3 Center for E-Business Technology Rankings Groups = Stars UserVisits Copyright  2010 by CEBT IDS Lab. Seminar – 14/38 Basic Steps for Analytical Query Processing Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30  Rankings map1 reduce1 UserVisits FILTER FILTER JOIN JOIN JOIN map2 reduce2 map3 reduce3 GROUP BY map4 FOREACH reduce4 Step 321 : Aggregation Pattern Matching Grouping Copyright  2010 by CEBT IDS Lab. Seminar – 15/38 Positioning  RAPID+  RDF Graph Pattern Matching  RAPID  Grouping & Aggregation Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 16/38 RAPID+ Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Pattern Matching in Pig: Approach 1 Rankings type R1 RankingsStarPattern = Ranking JOIN triples1 ON Sub, pageRank 11 pageURL triples2 ON Sub, url1 Triple store triples1 triples3 ON Sub; triples3 triples2 Sub Prop Obj Sub Prop Obj R1 R1 R1 UV1 UV1 type pageRank pageURL type srcIP Ranking 11 Url1 UserVisits 158.112.27. 3 R1 R1 R1 UV1 UV1 type pageRank pageURL type srcIP Ranking 11 Url1 UserVisits 158.112.27. 3 Sub Prop R1 R1 R1 UV1 UV1 type pageRank pageURL type srcIP Obj Ranking 11 Url1 UserVisits 158.112.27. 3 Rankings star pattern = 3-way self-join UserVisits star pattern = 5-way self-join Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 18/38 Pattern Matching in Pig: Approach 2 LOAD all the RDF triples SPLIT typeRanking destURL Sub Prop Obj UV1 destURL url1 UV2 destURL url1 Sub Prop Obj R1 type Ranking R2 type Ranking pageRank pageURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 typeUV visitDate srcIP Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 adRev Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 Filte r visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 UserVisits = JOIN (compute Star Pattern) Ranking = JOIN (compute Star Pattern) JOIN between Ranking, UserVisits Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 19/38 RAPID+  Propose UDFs for RDF pattern matching  Goal: Minimize I/O costs  Strategy:  Concurrent computation of star patterns using grouping-based algorithm  Can improve efficiency using Operator-coalescing and Look-ahead processing Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 20/38 Concurrent Star Pattern Matching Sub R1 R1 Ranking R1 R1 UV1 UV1 UV1 UV1 UV1 UserVisits UV1 UV1 UV1 UV1 UV1 Prop type pageRank pageURL avgDuration type srcIP destURL adRevenue visitDate userAgent cCode iCode sKeyword avgTime Obj Ranking 11 Url1 97 UserVisits 158.112.0 url1 339.08142 1979/12/12 SCOPE VNM VNM-KH comets 3 Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Single intermediate result Filter irrelevant properties Copyright  2010 by CEBT Sub Prop Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 UV1 type UserVisits UV1 srcIP 158.112.0 UV1 destURL url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 IDS Lab. Seminar – 21/38 Operator Coalescing Filter irrelevant triples by coalescing LOAD and FILTER operators Our Approach Using Pig Latin LOAD map1 FILTER Operator Coalescing map1 LOAD loadFilter input = LOAD ‘\data’ using loadFilter ( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits ) Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 22/38 Grouping-based Pattern Matching starSubgraphs = GROUP input BY $0; Sub Prop Obj R1 type Ranking R1 pageRank 11 R1 pageURL Url1 UV1 type UserVisits UV1 srcIP UV1 destURL 158.112.27. 3 url1 UV1 adRevenue 339.08142 UV1 visitDate 1979/12/12 Center for E-Business Technology GROUP BY Subject Copyright  2010 by CEBT IDS Lab. Seminar – 23/38 Filtering the Groups  Structure-based filtering eliminate sub graphs with missing properties visitDate between 1979/12/01 Missing srcIP and 1979/12/30  Value-based filtering validate each sub graph against filter condition Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 24/38 Look-ahead Processing Star Pattern Matching  Joining the Stars Look-Ahead - Annotate bag based on join key Join between the star sub graphs Eliminate properties irrelevant for future processing (join and filter prop)  Minimize size of intermediate results Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 25/38 Comparison: Pig vs RAPID+ Pig Approach RAPID+ Multiple map-reduce cycles - N star sub graphs  N cycles Single cycle - N star sub graphs  1 cycle Potential for increased I/O (i)Disk spills (SPLIT operator) (ii)Materialization of several intermediate results due to sequential computation of star patterns Minimized I/O (i)Filtering in triple storage model + load-filter coalescing (ii)Concurrent computation of star patterns (single intermediate result) Would require advanced optimization techniques - Introduce project operator to eliminate unneeded columns Smaller intermediate result sizes - Eliminate tuples and columns not necessary in future steps of processing Not applicable Minimize repeated tuple handling by look-ahead processing Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 26/38 Experiment Cost Analysis for Task A (PM) 5-node cluster Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 27/38 RAPID Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea RAPID  Propose a language, RAPID, which extends Yahoo’s Pig Latin language with query primitives for dealing with the graph structured nature of RDF  Propose an approach for achieving the scalable processing of the non-trivial analytical tasks on RDF datasets that is based on an efficient multidimensional query operator called the MD-Join Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 29/38 Architecture of RAPID Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 30/38 Data Expression  Class Expression  type:Class  Property Expression  Path Expression Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 31/38 MD-Join  The MD-Join: An Operator for Complex OLAP, ICDE 2001   Complex OLAP query is often difficult to express and difficult to optimize using standard relational operators – ex. compute the total sales broken down by all possible combinations of attributes prod, month, state – Tight coupling of the GROUPBY and aggregation clause is one reason of the problem MD-Join decouples the grouping and the aggregation clauses in query expressions, in order to achieve great flexibility and simplicity Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 32/38 MD-Join  The MD-Join operator MD (B, R, l, ) defines a relation with  B: base table (contains all combinations of key values representing each group on which aggregations have to be computed)  R: fact table (contains all the data that needs to be aggregated)  l: a list of aggregation functions that need to be computed over attributes   : is a set of conditions involving the attributes of B and R Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 33/38 MD-Join  MD (B, R, sum(price), Prod=R.Prod & Month = R.Month) Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 34/38 Extending Pig Latin for Analytical Processing RDF  Generating Fact Table (GFD)  fact_dataset = LOAD ‘input.rdf’ USING GFD (…);  Generating Base Table (GBD)  base_dataset = LOAD ‘input.rdf’ USING GBD(…);  Results of GFD & GBD are stored in the same MDJ.rdf file  MD-Join  output_dataset = LOAD ‘MDJ.rdf’ USING MDJ(…); Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 35/38 Experiment  DBLP  BSBM Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 36/38 Experimental Result Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 37/38 Summary  데이터 분석 도메인에서 MR 기반의 접근이 활용되고 있음  MR이 too low-level 이라 Pig 같은 도구들이 많이 사용됨  대용량 RDF 분석에도 MR 및 Pig의 활용이 시도되고 있음  RDF 모델 특성 때문에 기존 도구들에 약간의 수정이 필요  Pig 기반의 RDF 처리는 RAPID, RAPID+ 가 유이한 솔루션  RDF에 대해 Pig 보다는 좋은 성능을 보이고 있음 Center for E-Business Technology Copyright  2010 by CEBT IDS Lab. Seminar – 38/38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document