Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BigBench: Big Data Benchmark Proposal Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen BigBench • End to end benchmark • Based on a product retailer • Focus on - Parallel DBMS - MR engines • Initial work presented at 1st WBDB, San Jose • Collaboration with Industry & Academia - Teradata, University of Toronto, InfoSizing , Oracle • Full spec at 3rd WBDB, Xian, China 2 Outline • Data Model - Variety, Volume, Velocity • Data Generator - PDGF for structured data - Enhancement : Semi-structured & Text generation • Workload specification - Main driver: retail big data analytics - Covers : data source, declarative & procedural and machine learning algorithms. • Metrics • Evaluation - Done on Teradata Aster 3 Data Model • Big Data is not about size only • 3 V’s - Variety • Different types of data - Volume • Huge size - Velocity • Continuous updates 4 Data Model (Variety) 5 • Volume Data Model (continued) - Based on scale factor - Similar to TPC-DS scaling - Weblogs & product reviews also scaled • Velocity - Periodic refreshes for all data - Different velocity for different areas • Vstructured < Vunstructured < Vsemistructured - Queries run with refresh 6 Data Generator • "Parallel Data Generation Framework“ PDGF - For the structured part of model - Scale factor similar to TPC-DS • Extensions to PDGF - Web logs - Product reviews • Web logs: retail customers/guests visiting site - Web logs similar to apache web server logs - Coupled with structured data • Product reviews: Customers and guest users - Algorithm based on Markov chain - Real data set sample input - Coupled with structured data 7 Data Generator TextGen • Input - real review data Category information Rating Review text • Initial setup - Parse text • Produce tokens • Correlation between tokens - Build repository by category • API for data generation - Integrated with PDGF - Based on category ID 8 Workload (continued) • Workload Queries - 30 queries - Specified in English - No required syntax • Business functions (Adapted from McKinsey+) - Marketing • Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences - Merchandising • Assortment optimization, Pricing optimization - Operations • Performance transparency, Product return analysis - Supply chain • Inventory management - Reporting (customers and products) 9 Workload (continued) Technical Functions • Data source dimension - Structured - Semi-structured - Un-structured • Processing type dimension - Declarative (SQL, HQL) - Procedural - Mix of both • Analytic technique dimension - Statistical analysis: correlation analysis, time-series, regression - Data mining: classification, clustering, association mining, pattern analysis and text analysis - Simple reporting: ad hoc queries not covered above 10 Workload (continued) Business Categories Query Breakdown Business category 11 Number of Queries Percentage Marketing 18 60% Merchandising 5 17% Operations 4 13% Supply chain 2 7% New business models 1 3% Workload (continued) Technical Dimensions Breakdown Query processing type 12 Number of Queries Percentage Declarative 10 33% Procedural 7 23% Declarative + Procedural 13 43% Technical Dimensions Breakdown (continued) Data Sources Percentage Structured 18 60% Semi-structured 7 23% Un-structured 5 17% Number of Queries Percentage Analytic techniques 13 Number of Queries Statistics analysis 6 20% Data mining 17 57% Reporting 8 27% Metrics • Future Work • Initial thoughts - Focus on loading and type of processing - MR engines good at loading - DBMS good at SQL - Metric = 4 𝑇𝐷 ∗ 𝑇𝑃 ∗ 𝑇𝐵 ∗ 𝑇𝐿 • 𝑇𝐷 ∶ 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑓𝑜𝑟 𝐷𝑒𝑐𝑙𝑎𝑟𝑎𝑡𝑖𝑣𝑒 𝑄𝑢𝑒𝑟𝑖𝑒𝑠 • 𝑇𝑃 : 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑓𝑜𝑟 𝑃𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑎𝑙 𝑄𝑢𝑒𝑟𝑖𝑒𝑠 • 𝑇𝐵 ∶ 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑓𝑜𝑟 𝐵𝑜𝑡ℎ • 𝑇𝐿 : 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑓𝑜𝑟 𝐿𝑜𝑎𝑑𝑖𝑛𝑔 • Applicable to single and multi-streams 14 Evaluation • BigBench proof of concept • DBMS - Typically data loaded into tables Possibly parsing weblogs to get schema Reviews captured as VARCHAR or BLOB fields Queries run using SQL + UDF • MR engine - Data can be loaded on HDFS - MR, HQL, PigLatin can be used • DBMS & MR engine - DBMS with Hadoop connectors - Data can be placed and split among both - Processing can also be split among two 15 Evaluation (continued) • Teradata Aster - MPP database - Discovery platform - Supports SQL-MR • System - Aster 5.0 - 8 node cluster - 200 GB of data 16 Evaluation (continued) • Data generation - DSDGen produced structured part - PDGF+ produced semi-structured and un-structured • Data loaded into tables - Parsed weblogs into Weblogs table - Product reviews table • Queries - SQL-MR syntax 17 Evaluation (continued) • Example query • Product category affinity analysis - Computes the probability of browsing products from a category after customers viewed items from another category. - One form of market basket • Business case - Marketing - cross-selling • Type of source - Structured (web sales) • Processing type - mix of declarative and procedural • Analytics type - Data mining 18 Evaluation (continued) SELECT category_cd1 AS category1_cd , category_cd2 AS category2_cd , COUNT (*) AS cnt FROM basket_generator ( ON ( SELECT i. i_category_id AS category_cd , s. ws_bill_customer_sk AS customer_id FROM web_sales s INNER JOIN item i ON s. ws_item_sk = i_item_sk ) PARTITION BY customer_id BASKET_ITEM (' category_cd ') ITEM_SET_MAX (500) ) GROUP BY 1,2 order by 1 ,3 ,2; 19 Evaluation (Sample Queries) 20 Evaluation (Processing Type) 21 Conclusion • End to end big data benchmark • Data Model - Adding semi-structured and unstructured data - Different velocity for structured, semi-structured and unstructured - Volume based on scale factor • Data Generation - Unstructured data using Markov chain model. - Enhancing PDGF for semi-structured and integrating it with unstructured • Workload - 30 queries based on McKinsey report - Covers processing type, data source and data mining algorithms • Evaluation - Aster SQL-MR 22 Future Work • Add late binding to web logs - No schema/keys upfront • Finalizing - Data generator - Metric specifications. - Downloadable kit • More POC’s - Hadoop ecosystem like HIVE 23