Download KLEE: A Framework for Distributed Top-k Query

Document related concepts
no text concepts found
Transcript
中科院软件所
中国人民大学
Cloud-based Data Management: Challenges & Opportunities
Jiaheng Lu
Renmin Universtiy of China
2009-08-25
1
Research experience and interesting
National University of Singapore PhD
• XML query processing and XML keyword search
University of California, Irvine Postdoc
• Approximate string processing
• Data integration and data cleaning
Renmin University of China
• Cloud data management
• XML data management
2
Outline
Motivation: cloud data management
Database Future and Challenges:
• Large-scale Data management & transaction
processing
• Cloud-based data indexing and query optimization
Recent research work:
• An efficient multiple-dimensional indexes for cloud
data management
• CIKM Workshop CloudDB 2009
3
Motivation: Internet Chatter
4
BLOG Wisdom
“If you want vast, on-demand scalability, you
need a non-relational database.” Since
scalability requirements:
• Can change very quickly and,
• Can grow very rapidly.
Difficult to manage with a single in-house
RDBMS server.
Although RDBMS scale well:
• When limited to a single node.
• Overwhelming complexity to scale on multiple sever
nodes.
5
Current State
Most enterprise solutions are based on
RDBMS technology.
Significant Operational Challenges:
•
•
•
•
•
Provisioning for Peak Demand
Resource under-utilization
Capacity planning: too many variables
Storage management: a massive challenge
System upgrades: extremely time-consuming
6
Internet Search Data Analytics: A Case Study
Data analytics:
• Parsed WEB Logs ingested in a RDBMS store.
• Hourly and Daily summarization for custom reporting.
Operational nightmare:
• Maintaining live reporting system ON at all costs and at all
times.
• Timely completion of hourly summarization.
• Constant tension between Ad-hoc workload versus
reporting workload.
• Data-driven feedback to live products.
• Temporal depth of detailed data
7
Internet Search Data Analytics: A Case Study
Various solutions explored:
• Data Warehousing appliance for fast summarization.
• Parallel RDBMS technology for fast ad-hoc queries.
• Business Intelligence Products (Data Cubes) for fast and
intuitive reporting and analysis.
None of the solutions completely satisfactory:
• Plans to migrate low-level data to file-based system to
overcome Database scalability bottlenecks
8
Paradigm Shift in Computing
9
WEB is replacing the Desktop
10
What is Cloud Computing?
Old idea: Software as a service (SaaS)
• Def: delivering applications over the internet
Recently: “[Hardware, infrastructure,
Platform] as a service”
• Poorly defined so we avoid all “X as a service”
Utility Computing: pay-as-you-go computing
• Illusion of infinite resources
• No up-front cost
• Fine-grained billing (e.g. hourly)
11
Why Now?
Experience with very large datacenters
• Unprecedented economies of scale
Other factors
• Pervasive broadband internet
• Pay-as-you-go billing model
12
Cloud Computing Spectrum
Instruction Set VM (Amazon EC2, 3Tera)
Framework VM
• Google AppEngine, Force.com
13
Cloud Killer Apps
Mobile and web applications
Extensions of desktop software
• Matlab, Mathematica
Batch processing/MapReduce
14
Economics of Cloud Users
Pay by use instead of provisioning for peak
15
Economics of Cloud Users
Risk of over-provisioning: underutilization
16
Economics of Cloud Users
Heavy penalty for under-provisioning
17
Economics of Cloud Providers
5-7X economies of scale [Hamilton 2008]
Extra benefits
• Amazon: utilize off-peak capacity
• Microsoft: sell .NET tools
• Google: reuse existing infrastructure
18
Engineering Definition
Providing services on virtual
machines allocated on top of a large
physical machine pool.
19
Business Definition
A method to address scalability and
availability concerns for large scale
applications.
20
Data Management in the Cloud?
21
Cloud Computing Implications on DBMSs
Where do Databases fit in this paradigm?
Generational reality:
• Animoto.com
• Started with 50 servers on Amazon EC2
• Growth of 25,000 users/hour
• Need to scale to 3,500 servers in 2 days.
• Many similar stories:
• RightScale
• Joyent
• …
22
Clouded Data?
Reality Number Ⅰ:
• Unlimited processing assumption
• Interactive page views:
• By targeting large number of SQL queries against MySQL
• Still Expect sub-millisecond object retrieval
Reality Number Ⅱ:
• Why can’t the database tier be replicated in the same
way as the Web Server and App Server can?
→These are the major challenges for Data
Management in the cloud.
23
The Vision
R&D Challenges at the macro level:
• Where and how does the DBMS fit into this model.
R&D Challenges at micro level:
• Specific technology components that must be
developed to enable the migration of enterprise
data into the clouds.
24
Data and Networks: Attempt Ⅰ
Distributed Database (1980s):
• Idealized view: unified access to distributed data
• Prohibitively expensive: global synchronization
Remained a laboratory prototype:
• Associated technology widely in-use: 2PC
25
Data and Networks: Attempt Ⅱ
26
Data and Networks: Pragmatics
27
Database on S3: SIGMOD’08
Amazon’s Simple Storage Service(S3):
• Updates may not preserve initiation order
• No “force” writes
• Eventual guarantee
Proposed solution:
• Pending Update Queue
• Checkpoint protocol to ensure consistent ordering
• ACID: only Atomicity + Durability
28
Unbundling Txns in the Cloud
Research results:
• CIDR’09 proposal to unbundle Transactions
Management for Cloud Infrastructures
• Attempts to refit the DBMS engine in the cloud
storage and computing
29
Analytical Processing
30
Architectural and System Impacts
Current state:
• MapReduce Paradigm for data analysis
What is missing:
• Auxiliary structures and indexes for associative access to
data (i.e., attribute-based access)
• Caveat: inherent inconsistency and approximation
Future projection:
• Eventual merger of databases (ODSs) and data
warehouses as we have learned to use and implement
them.
31
Underlying Principles: CIDR’2009
Business data may not always reflect the state
of the world or the business:
• Inherent lack of perfect information
Secondary data need not be updated with
primary data:
• Inherent latency
Transactions/Events may temporarily violate
integrity constraints:
• Referential integrity may need to be compromised
32
Data Security & Privacy
Data privacy remains a show-stopper in the
context of database outsourcing.
Encryption-based solutions are too expensive
and are projected to be so in the foreseeable
future:
• Private Information Retrieval (Sion’2008)
Other approaches:
• Information-theoretic approaches that uses datapartitioning for security (Emekci’2007)
• Hardware-based solution for information security
33
Self management and self tuning
in cloud-based data management
Self management and self tuning
Query optimization on thousands of nodes
34
Remarks
Data Management for Cloud Computing
poses a fundamental challenge to database
researchers:
• Scalability
• Reliability
• Data Consistency
Radically different approaches and solution
are warranted to overcome this challenge:
• Need to understand the nature of new applications
35
References
Life Beyond Distributed Transactions: An Apostate’s
Opinion by P.Helland, CIDR’07
Building a Database on S3 M.Brartner, D.Florescu, D.Graf,
D.Kossman, T.Kraska, SIGMOD’08
Unbundling Transaction Services in the Cloud D.Lo,et,
A.Fekete, G.Weikum, M.Zwilling, CIDR’09
Principles of Inconsistency S.Finkelstein, R.Brendle,
D.Jacobs, CIDR’09
VLDB Database School (China) 2009
http://www.sei.ecnu.edu.cn/~vldbschool2009/VLDBSchoo
l2009English.htm
36
An Efficient Multi-Dimensional Index
for Cloud Data Management
CIKM workshop CloudDB09
37
Outline
INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH
KDTREE AND RTREE
Extended Nodes partition
• Node partition
• Cost Estimation Strategy
EVALUATION
38
Cloud Computing
Google File System
Yahoo PNUTS
39
Distributed Cloud base?
• BigTable
How to query on other attributes besides primary key?
• HBase
40
Distributed Index: Single Dimension?
S. Wu and K.-L. Wu, “An indexing framework for efficient
retrieval on the cloud,” IEEE Data Eng. Bull., vol. 32,
pp.75–82, 2009.
H. chih Yang and D. S. Parker, “Traverse: Simplified
indexing on large map-reduce-merge clusters,” in
Proceedings of DASFAA 2009, Brisbane, Australia, April
2009, pp. 308–322.
M. K. Aguilera, W. Golab, and M. A. Shah, “A practical
scalable distributed b-tree,” in Proceedings of VLDB’08,
Auckland, New Zealand, August 2008, pp. 598–609.
41
Outline
INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH
KDTREE AND RTREE
Extended Nodes partition
• Node partition
• Cost Estimation Strategy
EVALUATION
42
Framework of Request Processing in
Cloud
43
R-Tree
R-trees is a tree data structure that is
similar to a B-tree, but is used for spatial
access methods
44
KD-Tree
kd-tree (short for k-dimensional tree) is a
space-partitioning data structure for
organizing points in a k-dimensional space.
45
R-Tree & KD-Tree: RKDTree
Master
range:
0~2000,
500~1200
Slave
range:
800~3500,
300~1300
Slave
range :
6300~7000,
599~1400
Slave
range :
2000~40000,
3400~8900
Slave
range :
6800~9000,
3400~8900
Slave
46
Outline
INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH
KDTREE AND RTREE
Extended Nodes partition
• Node partition
• Cost Estimation Strategy
EVALUATION
47
Nodes partition for data summary
Random cutting: Pick several random values on the attribute and
cut by the points. with the random method you may receive great
performance, but also possible to have poor performance.
Equal cutting: Cut the attribute into several equal intervals. This
method is relatively stable since no extreme case will happen.
Clustering-based cutting: Cut the attribute by clustering values
on the attribute and cut between clusters. This method may
receive foreseeable better performance, but the time cost is also
apparently higher. The time complexity of a clustering algorithm is
typically O(nlogn) or even higher.
48
Nodes partition
Random cutting
Equal cutting
Clustering-based cutting
49
50
Dynamic maintenance of Indexes
Update of node cube:
• Why? If the data distribution in the node cube
have “greatly” changed and caused the cube to
be sparse or greatly uneven
• How? Reorganize the nodes partition again
• When? A two-phase approach
• After each update, compute the minimal ΔT for next
update
• When the ΔT expires, check if needs update
51
Dynamic maintenance of Indexes
Basic idea: benefit > cost
Volume of a node cube is defined as the
number of combination of records can be
made out of the cube. The volume can be
calculated as the product of lengths of all the
intervals. We note volume of a cube by v.
For the cube \{[1, 11], [2, 5]\}, the volume is
(11-1)*(5-2) = 30.
52
Dynamic maintenance of Indexes
Assumption:
• The amount of queries forwarded to each slave
node is proportional to the total volume of all the
node cubes of the slave node.
53
Dynamic maintenance of Indexes
benefit = (Δv/v) * nq * ΔT
• Δ v: decrement of volume after update
• nq: number of queries this node must process
before update.
cost = mt/qt
• mt: time cost of last update
• qt: time needed for processing one query
benefit > cost => T > (mt * v)/(qt * Δ v * nq)
54
Dynamic maintenance of Indexes
After Δ T expires, check if an update is
needed. This check involves following:
• Record update frequency
• Expected benefit ratio
• Performance requirement
We leave this as a future work.
55
Experimental Setup
6 machines
• 1 master
• 5 slaves : 100~1000 nodes
Each machine had a 2.33GHz Intel Core2
Quad CPU, 4GB of main memory, and a
320G disk.
Machines ran Ubuntu 9.04 Server OS.
56
Point Query Experiment Results
9000
Scan Table
8000
RKDTree
time cost/ms
7000
6000
5000
4000
NBRKDTree(Rando
m)
NBRKDTree(Equal)
NBRKDTree(KMeans)
3000
2000
1000
0
100
200
300
400
500
600
# nodes
700
800
900
1000
57
Range Query Experiment Results
Result Cover Rate: one ten thousandth
10000
Scan Table
9000
RKDTree
8000
NBRKDTree(Rando
m)
NBRKDTree(Equal)
time cost/ms
7000
6000
NBRKDTree(KMeans)
5000
4000
3000
2000
1000
0
100
200
300
400
500
600
# nodes
700
800
900
1000
58
Conclusions
In this paper we presented a series of approaches on
building efficient multi-dimensional index in cloud platform.
We used the combination of R-tree and KD-tree to support
the index structure.
We developed the node partition technique to reduce
query processing cost on the cloud platform.
In order to maintain efficiency of the index, we proposed a
cost estimation-based approach for index update.
59
Future works
Better node partition algorithms
Improve the estimation-based approach
Consider multiple replicas of data
60
谢谢,敬请提问交流!
61
Backup(1)
Result Cover Rate: one thousandth
1‰~2‰
300000
250000
200000
Series1
Series2
Series3
150000
Series4
Series5
100000
Series6
50000
0
1
2
3
4
5
6
7
8
9
10
11
62
Backup(2)
Result Cover Rate: one thousandth
4‰~5‰
250000
200000
Series1
150000
Series2
Series3
Series4
100000
Series5
Series6
50000
0
1
2
3
4
5
6
7
8
9
10
63