Download Given a query workload

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Serializability wikipedia , lookup

Microsoft Access wikipedia , lookup

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Ingres (database) wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Using Machine Learning Technique to Parallelize Databases:
Where each query answered by a single node
Jozsef Patvarczki, Elke A. Rundensteiner, and Neil T. Heffernan
Abstract
Proposed Solution
Web based applications often suffer when trying to scale to
support higher loads from the database being a bottleneck. We
propose a rule-based data replication middleware for using
multiple database servers for web applications. Knowing each
query template in advance allows us to propose better solutions
for balancing load across multiple servers in the scenario of
web applications, above and beyond what is supported for
traditional applications. Prior knowledge of all of the incoming
query templates and the workload give us the ability to select
an appropriate table placement where each query template can
be answered with a single database server. Our goal is to
minimize the effective response time from the database, by
figuring out how to distribute the data across multiple nodes
effectively.
Instead of using theory only to do database layout, we need a
system that will collect empirical data on when horizontal
partitioning (HP), vertical partitioning (VP), de-normalization
(DN), and full replication (FR) operators are effective. We have
implemented a brute force search technique
to try different operators, and then we used this empirically
measured data to see if any speed up has
occurred. After creating a large data set where these four
different operators have been applied to make different
databases, we can employ machine learning to induce rules to
help govern the physical design of the database across an
arbitrary number of computer nodes. This, in turn, would allow
the database placement algorithm to converge quickly over time
as its trains over a larger set of examples.
• We characterize the problem as an AI search over layout;
• Our hypothesis is that we can learn rules to capture
human-like expertise and use these rules to better partition
a given database;
• By the help of the learned rules, we will be capable to fit
layout characteristics, and the layout generation can be
faster and faster;
• We will perform the layout and empirically measure the
cost, since we want to know what is effective and under
what conditions;
• We explore multiple ways to represent this knowledge
(maybe decision-tree);
• We Apply cross-validation to prevent overfitting our rules
to training data;
Problem Statement
• A characteristic of web applications such as our ASSISTment
Intelligent Tutoring System (www.ASSISTment.org), is that we
know all the incoming query templates beforehand as the users
typically interact with the system through a web interface [1];
• Prior knowledge of all the incoming query templates and the
query workload give us the ability to select an appropriate table
placement;
• Given a query workload, that describes all the query templates
for a web-based application, and the percentage of queries of
each template that the application typically processes;
• Given this workload and the optimization goal, determine the
best possible placement using four operators (FR, HP, VP, and
DN) and arbitrary number of database servers answering each
query by a single node;
• Our optimization goal is to maximize the total system
throughput [2].
•Core parts of the system:
(a) A data placement algorithm that can converge
quickly over time as it trains over a set of examples and
machine learned rules;
(b) Parameterized and machine learned rules to help
govern the physical design of the database across an
arbitrary number of computer nodes;
(c) A shared-nothing data replication middleware for
Web-based applications that can be easily built using lowcost existing resources to realize database scaling
possibilities without expensive storage area networks
References
1. Tobias Groothuyse, Swaminathan Sivasubramanian, Guillaume Pierre, “Globetp: template-based database replication for
scalable web applications”, WWW07, Banff, Canada, pp. 301-310
2. Jozsef Patvarczki, Murali Mani, and Neil Heffernan, "Performance Driven Database Design for Scalable Web
Applications", Advances in Databases and Information Systems, In J. Grundspenkis, T. Morzy & G. Vossen (Eds) Advances
in Databases and Information Systems Springer-Verlag: Berlin, ISBN 978-3-642-03972-0, pp . 43-58.
Contact: Jozsef Patvarczki, [email protected]