Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 2 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 3 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Structured Data Item Marketprice Sales Web Page Reviews Customer Web Log Semi-Structured Data (c) Tilmann Rabl - msrg.org Unstructured Data Adapted TPC-DS BigBench Specific 7/19/2013 4 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org Generalization Parameter Generation (PDGF) Tokenization Markov Chain Input Text Generation Generated Reviews Product Customization Real Reviews Categorization Offline Preprocessing Online Data Generation 7/19/2013 5 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 6 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 7 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Data Sources Number of Queries Percentage Structured 18 60% Semi-structured 7 23% Un-structured 5 17% Analytic techniques Number of Queries Percentage Statistics analysis 6 20% Data mining 17 57% Reporting 8 27% (c) Tilmann Rabl - msrg.org 7/19/2013 8 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 9 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 10 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 11 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Query Types Number of Queries Percentage Pure Hive 14 47% Mahout 5 17% OpenNLP 4 13% Custom MR 7 23% (c) Tilmann Rabl - msrg.org 10/10/2013 12 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ADD FILE q1_mapper.py; ADD FILE q1_reducer.py; Reducer import sys --Find the most frequent ones def print_permutations(vals): SELECT pid1, pid2, COUNT (*) AS cnt l = len(vals) FROM ( if l <= 1 or l*(l-1)/2 > 500: --Make items basket return FROM ( vals.sort() -- Joining two tables for i in range(l-1): FROM ( for j in range(i+1,l): SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pid print "%s\t%s" % (vals[i], vals[j]) FROM store_sales s INNER JOIN item i ON s.ss_item_sk = i.i_item_sk if __name__ == "__main__": WHERE i.i_category_id in (1 ,4 ,6) and s.ss_store_sk in (10 , 20, 33, 40, 50) current_key = '' ) temp_join vals = [] MAP temp_join.oid, temp_join.pid for line in sys.stdin: USING 'python q1_mapper.py' key, val = line.strip().split("\t") AS oid, pid if current_key == '' : CLUSTER BY oid current_key = key ) map_output vals.append(val) REDUCE map_output.oid, map_output.pid Mapper elif current_key == key: USING 'python q1_reducer.py' import sys vals.append(val) AS (pid1 BIGINT, pid2 BIGINT) elif current_key != key: ) temp_basket if __name__ == "__main__": print_permutations(vals) GROUP BY pid1, pid2 vals = [] HAVING COUNT (pid1) > 49 for line in sys.stdin: current_key = key ORDER BY pid1 ,cnt ,pid2; key, val = line.strip().split("\t") vals.append(val) print "%s\t%s" % (key, val) print_permutations(vals) (c) Tilmann Rabl - msrg.org 10/10/2013 13 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 14 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 15 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 16 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 17