Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Generation
for
Application-Specific Benchmarking
Y.C. Tay
National University of Singapore
Background
benchmarks help research and development
--- the dominant database benchmark is TPC
SIGMOD Conference 2011
research track: 87 papers, 17 use TPC (20%)
industry track: 14 papers, 6 use TPC (43%)
Problem :
a few TPC benchmarks
but many, many applications
TPC becoming irrelevant?
Vision
a paradigm shift in database benchmark development
from
to
top-down
committee consensus
domain-specific
package (data generator + queries)
bottom-up
community collaboration
application-specific
tools (dataset scaling)
synthetically scale up/down
application data
application already
has queries
Challenge
Dataset Scaling Problem :
Given a set of relational tables D and a scale factor s,
generate a database state D’ that is similar to D but s times its size.
E.g. What would DBLP look like in 2020?
s>1
why: scalability testing
difficulty: copying doesn’t work (e.g. social network data)
s<1
why: application testing
difficulty: sampling not straightforward (similar to web crawling)
s=1
why: privacy/proprietary reasons
difficulty: encryption is risky
Challenge
Dataset Scaling Problem :
Given a set of relational tables D and a scale factor s,
generate a database state D’ that is similar to D but s times its size.
by query results
difficulty: data correlation
E.g. database = {photos, owners, comments, tags}
inter-column correlation
inter-row correlation
inter-column + inter-row
• foreign keys
• photo dimensions
(same camera)
• 2 users comment on
each other’s photos
(social network)
• age and gender
• user likely to comment
on own photos
• gardener likely to tag
photos of flowers
• tags used by gardener
(“rose”, “bee”, “beetle”)
Challenge
scaling a social network:
extract
D
G
empirical
dataset
empirical
social graph
use join query
~
G
scale by s
inject
synthetic
social graph
use graph theory
#edges?
#triangles?
path lengths?
D
synthetic
dataset
any database theory?
~
E.g. how to inject into D
~
* correlation from G indicating X and Y comment on each other’s photos
* correlation between Alice’s birthday and wall posts by her classmates
* correlation among tags used by bird watchers
~
Challenge
* online social networks are here to stay
* their datasets can be huge
* their datasets have commercial value
where is the database theory?
Attribute Value Correlation Problem for Social Networks :
Suppose a dataset D records data from a social network.
How do the social interactions affect the correlation
among attribute values in D ?
Vision (for the next 25 years):
a paradigm shift from a top-down design of domain-specific
benchmarks by committee consensus to a bottom-up collaborative
development of tools for application-specific dataset scaling
Challenges:
• Dataset Scaling Problem
• Attribute Value Correlation Problem for Social Networks
Payoff:
• commercial value in dataset scaling tools
• new database research areas (social network data, schema design,
vertical/horizontal partition, query optimization, business intelligence, …)
Start:
UpSizeR (http:www.comp.nus.edu.sg/~upsizer )
• single-server version
• Hadoop version