Download Scaling Out SQL Azure with Database Sharding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

PL/SQL wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Demo Script
Scaling Out SQL Azure with Database Sharding
Lab version:
2.0.0
Last updated:
5/3/2017
CONTENTS
OVERVIEW ................................................................................................................................................... 3
Key Messages ........................................................................................................................................... 3
Key Technologies ...................................................................................................................................... 3
Time Estimates ......................................................................................................................................... 4
SETUP AND CONFIGURATION .................................................................................................................. 4
Task 1 –Running the Dependency Checker........................................................................................... 4
Task 2 - Configuring the Database Connection String .......................................................................... 5
Task 3 – Creating and Populating Shard Databases with Sample Data ................................................ 5
DEMO FLOW ................................................................................................................................................ 7
OPENING STATEMENT............................................................................................................................... 9
STEP-BY-STEP WALKTHROUGH .............................................................................................................. 9
Load and Partition Data .......................................................................................................................... 10
Review the Sample Application............................................................................................................... 13
Query and Insert Partitioned Data .......................................................................................................... 19
SUMMARY .................................................................................................................................................. 26
Overview
This document provides setup documentation, step-by-step instructions, and a written script for showing a demo of SQL Azure. This document
can also serve as a tutorial or walkthrough of the technology. In this demo you will see the basics of inserting and querying data from a sharded
SQL Azure database. For additional demos of the Windows Azure platform, please visit http://www.azure.com.
Note: In order to run through this demo, you must have a SQL Azure developer account. For more information on how to purchase an
account, visit the SQL Azure Portal at http://www.microsoft.com/windowsazure/sqlazure/.
Key Messages
In this demo you will see three key things:
1. Database sharding adds complexity to your application but it’s managable
2. You can abstract the partitioning strategy from your individual queries
3. Running queries in parallel is important for performance
Key Technologies
This demo uses the following technologies:
1. SQL Azure
2. .NET Framework 4.0
3. Microsoft Visual Studio 2010
4. Windows Azure Tools for Microsoft Visual Studio
Time Estimates

Estimated time for setting up and configuring the demo: 10+ mins (depending on how much sample data you generate)

Estimated time to complete the demo: 20 min
Setup and Configuration
The setup and configuration for this demo involves the following tasks:
1. Run the Dependency Checker
2. Configure the Database Connection String
3. Create and Populating Shard Databases with Sample Data
Task 1 –Running the Dependency Checker
This demo does not have any advanced configuration requirements. You simply need to have the prerequisites installed and have an account
for SQL Azure. For more information on how to purchase an account, visit the SQL Azure Portal at
http://www.microsoft.com/windowsazure/sqlazure/.
The following steps describe how to run the Dependency Checker utility included with the demo to verify that you have the prerequisite
components. You can skip this exercise if you are confident that the prerequisites are properly installed and configured.
1. Open a Windows Explorer window and browse to the demo’s Source\Script folder.
2. Double-click the Dependencies.dep file in this folder to launch the Dependency Checker tool and install any missing prerequisites.
3. If the User Account Control dialog is shown, confirm the action to proceed.
Note: This process may require elevation. The .dep extension is associated with the Dependency Checker tool during its installation. For
additional information about the setup procedure and how to install the Dependency Checker tool, refer to the Setup.docx document in the
Assets folder of the training kit.
Note: As shipped, the connection string template in the configuration file points to a local (default) instance of SQLExpress. To load data into
SQL Azure, you will need to update the connection string in the configuration file to use your SQL Azure account settings.
Task 2 - Configuring the Database Connection String
To use SQL Azure during the demo, you will need to update the connection string used by the application.
1. Open the ServiceConfiguration.cscfg file in the \code\sharding.cloud\ folder and update the connection string in the ShardsDB setting
with your SQL Azure account information.
2. Open the Web.config file in the \code\sharding.web\ folder and update the connection string in the ShardsDB setting with your SQL
Azure account information.
3. Open the App.config file in the \assets\ShardDbLoader\ folder and update the connection string in the ShardsDB setting with your SQL
Azure account information.
Task 3 – Creating and Populating Shard Databases with Sample Data
The sample application in this demo requires its shard data to be stored in SQL Azure. Populating the necessary databases takes time so you
need to ensure that this procedure has completed before you start the demo. Prepare this step well in advance.
The demo assets include an application that will generate random data and populate the shard databases in SQL Azure. You can control the
amount of data produced by specifying how many contacts and orders the program needs to generate. In addition, the shard loader supports
two separate partitioning strategies, specifically by date and by country. Running the program with one partitioning scheme does not overwrite
the data generated with the other, which means that you could have two separate sets of databases, one for each partitioning strategy. You
specify program options in its command line. To view available options, type SHARDDB /help.
Note: Actually, a user can create only five databases on SQL Azure. For this reason, you will not be able to create both partitioning strategies
(date and country) on SQL Azure at the same time using the same user.
Since, this demo script has been prepared to show how sharding can be performed on SQL Azure using a country partition strategy; you should
use the ShardDB command line application implementing a country partitioning strategy when you run it against SQL Azure.
Figure 1
Shard loader application
To run the application, follow these steps:
1. Open a command prompt and change the current directory to the assets\ShardDbLoader folder of the demo. If you executed the
Configuration Wizard, it should contain a SHARDDB.EXE executable that resulted from the successful compilation of the solution;
otherwise, build the solution in the code folder of the demo installation directory.
2. To create a new set of sample data, run the loader application specifying the /create option and the number of orders and contacts, as
well as the partitioning strategy to use. For example, to generate 5000 orders and 300 contacts using the country partitioning strategy,
run the shard loader with the following command line.
SHARDDB.EXE /create orders:5000 contacts:300 partition:country
3. To delete data generated by this application, you need to run the loader with the /delete option and specify which partitioning strategy
you wish to delete. For example, to delete the data generated by the country partitioning scheme, specify the following command line.
SHARDDB.EXE /delete partition:country
Demo Flow
The following diagram illustrates the high-level flow for this demo and the steps involved:
Figure 2
Demo Flow
Opening Statement
During this demo, we will examine a sample application that was built to highlight some of the fundamentals of building apps that connected to
sharded databases. It’s critical to understand when it’s appropriate to shard your database and that the decision to do so should not be taken
lightly as it adds an additional level of complexity to your application. For certain scenarios the benefits of database scale out can be huge. This
is particularly true for applications that require massive throughput.
There are many different approaches for scaling out your databases. It’s very important to determine the right strategy as it will directly impact
the complexity and performance of your solution. This demo focuses on a scenario where you have a workload for a single application spread
across multiple databases with the need for you application to determine dynamically which database to connect to for a given query and the
ability combine the result sets. For some scenarios, especially for Independent Software Vendors (ISVs) a much simpler strategy may be to
assign each customer their own database, removing the need to allow for queries across multiple databases.
When using a scale out database strategy you get the added resources of each of the machines processing your workload. Having one 10 GB
database is not the same as having ten 1 GB databases. When you have ten 1GB databases you have distributed your workload over many more
machines which is particularly important for high throughput.
In future versions of SQL Azure we can expect to see more features for helping to managing the partitioning of your data among multiple
databases and fan-out queries.
In this demo you will specifically see three key things:
1. Database sharding adds complexity to your application but it’s managable
2. You can abstract the partitioning strategy from your individual queries
3. Running queries in parallel is important for performance
Step-by-Step Walkthrough
This demo consists of the following segments:
1. Load and Partition Data
2. Review the Sample Application
3. Query and Insert Partitioned Data
Load and Partition Data
In this segment, you load sample data into a local SQLExpress instance to show how the partitioning strategy affects the distribution of data in
the shards.
Action
Script
1. Open a command prompt and change
the current directory to the
assets\ShardDbLoader folder of the
demo installation directory.

To start this demo, we first need to
populate the store used by the
application with sample data. To do
this, we will use a console
application that loads random data
into the databases. Each database
contains a copy of the tables used by
the sample application, namely
Contacts, Products,
SalesOrderHeader, and
SalesOrderDetail.

As you will see later in the demo, the
sample application uses SQL Azure
for storage, but to illustrate how the
partitioning strategy affects the
distribution of data in each partition,
we will initially use a local SQL
Server instance to create two
separate sets of shards, each
partitioned with a different strategy.
Developing locally and deploying to
SQL Azure is a pattern we expect to
see for building applications.
2. Execute SHARDDB.EXE specifying a
country partitioning strategy. You can
use the default values for the inserted
contact and order count parameters.
SHARDDB /create partition:country
Note: If you previously executed this
demo script, you will already have
created the partitioned databases. To
delete them, use the /delete
parameter in the loader application
and specify the partitioning strategy
for which to delete data. For example,
SHARDDB /delete partition:country
Note: As shipped, the connection
string template in the configuration file
points to a local (default) instance of
Screenshot
SQLExpress. If you previously
configured the connection string to
load data into SQL Azure, you will
need to restore its original value to
use local storage.

Note that some of the tables contain
reference data and its content is not
partitioned; instead, the tables are
replicated to each shard. This is the
case for both the Contacts and
Products tables. On the other hand,
the data in the SalesOrderHeader
table is partitioned using a
configurable strategy. In addition,
SalesOrderDetail is considered a
child of the SalesOrderHeader
table. Hence, each row in this table
is stored in the shard of its
associated row in the parent table. In
other words, all detail rows for an
order are stored in the same shard
as the order header.

We will now create our data set.
First, we will partition the rows by
country. The shard loader application
allows us to specify the partitioning
strategy in its command line. We can
also specify the number of contacts
and orders inserted into each shard
but we will use the default values for
now. This will allows us to insert a
limited number of rows into the local
SQL Server instance, enough to
show the distribution of data based
on a partition strategy.
3. Open SQL Server Management
Studio and connect to the local SQL
Server instance.
4. In the Object Explorer, locate the set
of databases that correspond to the
country partitioning masscheme. They
should be named COUNTRY_00 to
COUNTRY_0N.

Now that we have created our
sample data, we will use SQL Server
Management Studio to examine the
data inserted by the loader and see
how it has distributed the information
in the shards.

First, we will view the contents for
the SalesOrderHeader table in one
of the databases in the set. You can
see that the Country column shows
that the table only contains rows for
a single country.

Looking at the contents of the same
table in one of the other databases in
this set, we can see that it contains
data for a different country.

Next, we will repeat the loading
process but this time specifying that
the data be partitioned by date or, to
be precise, by quarter.
5. Pick one of the databases in this set,
expand the Tables node, right-click
the SalesOrderHeader table, and
choose Select Top 1000 Rows.
6. Repeat the process to show the
SalesOrderHeader in one of the
other databases in this set.
7. Execute the SHARDDB application
once again, this time specifying a date
partitioning strategy.
SHARDDB /create partition:date
8. Switch back to SQL Server
Management Studio.
9. In the Object Explorer, locate the set
of databases that correspond to the
data partitioning scheme. They should
be named QUARTER_01 to
QUARTER_4.
10. Show the contents of the
SalesOrderHeader table for one of
the databases in this new set.

Once again, we can look at the data
inserted by the loader in SQL Server
Management Studio. As you can
see, it has created a new set of
databases, one for each quarter.

Examining the contents of the
SalesOrderHeader table, we can
see that in this new partitioning
scheme, the table now contains
orders for a single quarter of the
year. Note that in this demo, for
simplicity, the date partitioning
scheme only takes into account the
quarter so that orders for different
years could potentially be allocated
to the same shard.
Review the Sample Application
In this segment, you review the sample application and provide a brief description of its implementation.
Action
Script
Screenshot
1. Start Visual Studio 2010 as an
administrator (required to run in the
development fabric).

I will now give you a quick tour of a
sample application used in this demo.
We’ve built a simple Windows Azure
web application designed to highlight
some features of database sharding.
Let’s start Visual Studio 2010 and
open the solution.

The solution contains three projects.
The first one is a sample Web site
that allows us to query and insert
orders into a SQL Azure data store.
Next, we have a cloud service project
that we use to host the site as a Web
role in Windows Azure. Finally, the
third project in the solution contains
the console application that we used
earlier to load and partition our
sample data.

As we discussed previously, data is
partitioned based on a given strategy.
For this application, the data access
class is designed to accept pluggable
partitioning strategies. A partitioning
strategy is simply a class that
implements a special interface named
IShardPartitionStrategy. Let me
briefly describe how it works.
2. Open the
Microsoft.Samples.Sharding.sln
solution in the code folder of this
demo.
3. Open the
IShardPartitionStrategy.cs file in
the Partition folder of the
Sharding.Web project and show its
methods.

The GetShardFor method is the core
method in this interface. The purpose
of this method is to map any row in a
table to a specific shard based on the
value of its columns.

To implement the GetShardFor
method, it requires metadata that
specifies whether a table needs to be
partitioned and if so, the name of the
field to use as the partition key. The
interface exposes this metadata
through its PartitionMetadata
property.

The Shards property returns a list of
identifiers for every shard available in
a partitioning scheme. To illustrate
with an example, if we use a strategy
that partitions by country, it would
return a list that contains a shard
identifier for each country. An
application can map each identifier in
the list to a connection string that
points to the corresponding target
database.

Finally, we have a Name property
that returns a description of the
partitioning strategy which could be
used to identify how the data is
partitioned.
4. Open the
ByDatePartitionStrategy.cs file in
the Partition folder of the
Sharding.Web project.

To create a partitioning strategy, you
need a class that implements the
IShardPartitionStrategy interface. In
the sample, we provide two different
strategies, the first partitions data
based on the country where the order
is shipped, while the second one
partitions based on the quarter of the
order. The shard loader application
that we used at the start of the demo
shares the same partitioning
strategies to load its data. Let’s
quickly look at the implementation for
each of these classes.

First, we will review the
ByDatePartitionStrategy. Here we
have the metadata that determines
how each table is partitioned. Notice
that it contains an entry for each of
the tables. Each table has metadata
that specifies whether the table needs
to be partitioned (or Sharded), and
which partition field to use. Tables
that are not partitioned are identified
as Global.
5. Highlight the EntityPartitionMap field
at the top of the class.
6. Briefly describe the implementation of
the GetShardFor method.

Next, we will look at the
implementation of the GetShardFor
method. The method starts by
retrieving the metadata for the chosen
table to determine whether it needs to
be partitioned.

For “Global” tables, it returns a list of
every available shard. This is for the
benefit of the loader application,
which can replicate the data to each
shard. For queries, any of the shards
in this list can be used to retrieve data
since they all contain the same data.

For “Sharded” tables, the method
retrieves the value of the partition key
field, it ensures that it is of DateTime
type, and then extracts its quarter.

Finally, it calls the GetShardName
using the month value as a
parameter. This last method simply
returns an identifier of the form
QUARTER_{0}, where the
placeholder is replaced by the quarter
value based on the month. The shard
identifier is then returned to the
application.

The ByCountryPartitionStrategy is
very similar. Examining the
GetShardFor method in this class,
we see that it too retrieves the value
of the partition key field and searches
an array of countries for this value. It
then uses the index of the matching
entry to call the GetShardName
method, which returns a shard
identifier of the form COUNTRY_{0},
where the placeholder is replaced by
the index of the country.

For this demo, we will run the sample
app locally using the development
fabric.

Regardless of its execution
environment, you can see from the
connection string in the configuration
file that the application uses storage
in SQL Azure. Notice that the Initial
Catalog setting in the connection
string includes a placeholder for the
database name. The application uses
this connection string as a template
and maps each shard to a different
database by replacing the database
name with the shard identifier.
Open the
ByCountryPartitionStrategy.cs file in
the Partition folder of the Sharding.Web
project to discuss the implementation of
the GetShardFor method.
7. Open the
ServiceConfiguration.cscfg file in
the CloudService project and locate
the shardsDB setting in the
configuration.
8. Show that the connection string
points to database.windows.net and
that the database name (Initial
Catalog) is specified as a
placeholder.
9. Open the Default.aspx page and
switch to design view to show that
different queries are available.

To complete our review of the sample
application, let’s examine its user
interface. The application is very
simple and contains a main page
where you execute different queries
against each of the partitions.

Each query exposed in the UI is
associated with a method in the data
access class.
Query and Insert Partitioned Data
In this segment, you show how to query and insert data in the partitioned database.
Action
Script
Screenshot
1. Press F5 to run the sample
application.

Let’s start the application and see it in
action. We have already pre-loaded a
sample set of data into SQL Azure
and we are going to use that for this
demo.

First, we will execute a query that
retrieves all orders for a given
country. Given the current partitioning
strategy, "partition by country", this
query only needs to query a single
shard to retrieve its results. Notice the
Shard Name column that shows the
name of the database where each
row is stored and in particular, how in
this query, every row belongs to the
same shard.

If we now choose a different country,
we can see that we are now retrieving
data from a different shard.

We will now see how this is
implemented. We can place a
breakpoint in the data access class
and step through the method that
executes this query.

The GetOrdersByCountry method
first constructs a SqlCommand
object with a query that retrieves all
orders from the SalesOrderHeader
table using the country name as a
parameter to filter the rows and get
back a single country.

It then calls the GetShardFor method
of the current partitioning strategy to
2. Select the Orders by Country tab in
the application UI.
3. Pick a country from the drop down list
and click the green arrow icon.
4. Draw attention to the Shard Name
column in the results grid.
5. Select a different country and repeat
the query.
6. Show that the Shard Name displayed
for this new set of results differs from
the previous one.
Note: As shipped, the connection
string template in the configuration
file points to a local (default) instance
of SQLExpress. To use SQL Azure,
you need to update the
ServiceConfiguration.cscfg file in the
CloudService project.
7. Switch to Visual Studio, open the
DataAccess.cs file in the Model
folder, and locate the
GetOrdersByCountry method. Place
a breakpoint at the start of this
method.
8. In the browser window, execute the
query once again to force execution
to stop at the breakpoint.
9. Press F10 to single step through the
code until you reach the line following
the call to
PartitionStrategy.GetShardFor
method.
10. Place the cursor over the shards
variable and press SHIFT+F9 to open
a QuickWatch window for this
variable. Show that the list of shard
identifiers contains a single element.
11. Press F5 to resume execution.
determine which shards it needs to
query and stores the result in the
shards variable. Notice that we are
passing the value of the selected
country into this method.

If we examine the result stored in the
shards variable, we can see that it
contains a single shard identifier. This
matches our expectation since, given
the current partitioning strategy, we
only need to query one shard to
retrieve the results for a single
country.

The method then calls
ExecuteWithThreadPool to execute
the query using the thread pool. We
will talk more about this method when
we discuss querying in parallel across
shards.

Finally, it returns a DataTable with
the result of the query, which we can
use to bind to the application UI.
12. Select the Orders by Product tab in
the application UI.

Now that we have seen a query that
runs within a single shard, we will turn
our attention to queries that can span
multiple shards.

We will now retrieve all orders that
contain a given product. Because the
partitioning scheme splits rows across
different shards based on their
country, we need to query every
shard to retrieve this information.

Looking at the result of this query,
notice how it contains data from
different countries and in particular,
how the shard name column shows
that the information was retrieved
from multiple shards.

We will now look at the code that was
used for this query. In fact, we can
immediately see that it is very similar
to the code in the
GetOrdersByCountry method that
we saw previously. Naturally, the SQL
statement in the query is different.
The other notable difference is that
we are now passing in a ProductID
value to the GetShardFor method.

If we step through this code past the
call to GetShardFor and examine the
value of the shards variable, we can
see that in this case, it contains a list
with multiple shard identifiers, which
means that we must query each one
of these shards to retrieve the results
of the query.
13. Select a product from the drop down
list and click the green arrow icon.
14. Switch to Visual Studio, open the
DataAccess.cs file in the Model
folder, and locate the
GetOrdersByProduct method. Place
a breakpoint at the start of this
method.
15. In the browser window, execute the
query once again to force execution
to stop at the breakpoint.
16. Press F10 to single step through
each line of code until you reach the
line following the call to
PartitionStrategy.GetShardFor
method.
17. Place the cursor over the shards
variable and press SHIFT+F9 to open
a QuickWatch window for this
variable. Show that the list of shard
identifiers contains multiple elements.
18. Press F11 to step into the
ExecuteWithThreadPool method.

This time we are going to step into
the ExecuteWithThreadPool method
and see how it executes the query in
parallel over multiple shards and
merges the partial results from each
thread to return its result.

The parameters of this method
include a SqlCommand object that
specifies the query to execute, and a
list of shard identifiers.

To process several queries in parallel,
the method iterates over the list of
shard identifiers and queues worker
jobs in the thread pool to execute the
query simultaneously over separate
connections that point to each shard.
To initialize a connection, it generates
a connection string from the template
in the configuration file, which we saw
earlier in the demo.

After queuing all the queries, it waits
for the threads to complete and calls
the MergeResults method to
assemble the partial results from
each query into a single DataTable.

As you can see, although interacting
with a sharded database adds some
complexity to your application it’s
manageable and can be worth the
performance benefits of using a
scaled out database solution.
19. Step through the code to show a
single iteration of the loop, briefly
describing how it works.
20. Press F5 to resume execution.
21. Select the Insert Order tab in the
application UI.
22. Enter a new Order Date, select a
Country.

Next, we will see what happens when
we insert new orders into the data
store.

Here, we insert a new order into the
database. For the purposes of this
demo, we are only going to enter the
Order Date and Country fields,
which we assume is the country
where the order is to be shipped.
Note that this can be different from
the country of the contact. The
remaining fields are filled with random
data, except for the ID of the Contact,
which will always be 1.

Let’s choose a country, say
“Venezuela”, and select today as the
order date. This will make it easier to
remember when we have to locate
the inserted order later on.
23. Record the values chosen for order
date and country.
24. Explain that a new order is inserted
with default values (contact id #1 is
automatically assigned).
25. Switch to Visual Studio and locate the
InsertOrder method in the
DataAccess.cs file. Place a
breakpoint at the start of this method.
26. Switch back to the browser window,
and click the green arrow icon to
insert the new order and pause at the
breakpoint.

We can set a breakpoint in the
InsertOrder method of the data
access class to see its
implementation. The method uses
classic ADO.NET techniques to build
a SQL command that inserts the
values of the order into the database.

Once again, the GetShardFor
method of the partitioning strategy
determines the shard where the new
order is written. If we examine its
return value, we can see that it
contains a single shard identifier,
which the method uses to create a
connection string using the template
in the configuration file.

Finally, it inserts the order by opening
a connection to the database using
the mapped connection string and
executing the command.
27. Continue execution until the line
following the call to GetShardFor.
28. Place the cursor over the shards
variable and press SHIFT+F9 to open
a QuickWatch window for this
variable. Show that the list contains a
single element.
29. Press F5 to resume execution.
30. Select the Orders by Date tab.

Next, we will locate the order that we
have just inserted. To do this, we will
retrieve the orders for today.
Remember that we used today’s date
when we inserted the order.

Notice that the order has been placed
in the COUNTRY_04 shard, which is
the shard that contains data for the
country that we picked when we
inserted the order, that is
“Venezuela”.
31. Enter the date that you specified
when you previously inserted the
order and click the green arrow icon.
Note: The “Orders by Date” query
retrieves all the orders for a given
date. Given the random contents of
the database, there is a chance that
the results may contain additional
orders. You can identify the correct
order by its contact, which should be
Baldwin Museum of Science (1).
Summary
In this demo, you saw the fundamentals of building an application uses a database sharding strategy to scale out over multiple machines.
Hopefully you take away the key points that:
1. Database sharding adds complexity to your application but it’s managable
2. You can abstract the partitioning strategy from your individual queries
3. Running queries in parallel is important for performance
Making the decision to use database sharding is something that should be cafefully considered as it can have deep implications on your solution.
A Proof Of Concept is being developed that show how to use the Microsoft Entity Framework with database sharding in order to manage some
of this complexity abstracting this complexity from the rest of your application. Check the following link for more information
http://go.microsoft.com/fwlink/?LinkID=161238&clcid=0x409