Download WBDB2013BigBench-A_BigBench_Implementation_i

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
2
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
3
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Structured Data
Item
Marketprice
Sales
Web Page
Reviews
Customer
Web Log
Semi-Structured Data
(c) Tilmann Rabl - msrg.org
Unstructured
Data
Adapted
TPC-DS
BigBench
Specific
7/19/2013
4
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
Generalization
Parameter
Generation
(PDGF)
Tokenization
Markov
Chain Input
Text Generation
Generated
Reviews
Product
Customization
Real
Reviews
Categorization
Offline Preprocessing
Online Data Generation
7/19/2013
5
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
7/19/2013
6
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
7/19/2013
7
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Data Sources
Number of Queries
Percentage
Structured
18
60%
Semi-structured
7
23%
Un-structured
5
17%
Analytic techniques
Number of Queries
Percentage
Statistics analysis
6
20%
Data mining
17
57%
Reporting
8
27%
(c) Tilmann Rabl - msrg.org
7/19/2013
8
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
7/19/2013
9
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
7/19/2013
10
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
11
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Query Types
Number of Queries
Percentage
Pure Hive
14
47%
Mahout
5
17%
OpenNLP
4
13%
Custom MR
7
23%
(c) Tilmann Rabl - msrg.org
10/10/2013
12
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ADD FILE q1_mapper.py;
ADD FILE q1_reducer.py;
Reducer
import sys
--Find the most frequent ones
def print_permutations(vals):
SELECT
pid1, pid2, COUNT (*) AS cnt
l = len(vals)
FROM (
if l <= 1 or l*(l-1)/2 > 500:
--Make items basket
return
FROM (
vals.sort()
-- Joining two tables
for i in range(l-1):
FROM (
for j in range(i+1,l):
SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pid
print "%s\t%s" % (vals[i], vals[j])
FROM store_sales s
INNER JOIN item i ON s.ss_item_sk = i.i_item_sk
if __name__ == "__main__":
WHERE i.i_category_id in (1 ,4 ,6) and s.ss_store_sk in (10 , 20, 33, 40, 50)
current_key = ''
) temp_join
vals = []
MAP temp_join.oid, temp_join.pid
for line in sys.stdin:
USING 'python q1_mapper.py'
key, val = line.strip().split("\t")
AS oid, pid
if current_key == '' :
CLUSTER BY oid
current_key = key
) map_output
vals.append(val)
REDUCE map_output.oid, map_output.pid
Mapper
elif current_key == key:
USING 'python q1_reducer.py'
import sys
vals.append(val)
AS (pid1 BIGINT, pid2 BIGINT)
elif current_key != key:
) temp_basket
if __name__ == "__main__":
print_permutations(vals)
GROUP BY pid1, pid2
vals = []
HAVING COUNT (pid1) > 49
for line in sys.stdin:
current_key = key
ORDER BY pid1 ,cnt ,pid2;
key, val = line.strip().split("\t")
vals.append(val)
print "%s\t%s" % (key, val)
print_permutations(vals)
(c) Tilmann Rabl - msrg.org
10/10/2013
13
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
14
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
15
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
10/10/2013
16
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
(c) Tilmann Rabl - msrg.org
7/19/2013
17
Related documents