Download Scaling Out SQL Azure with Database Sharding

Demo Script Scaling Out SQL Azure with Database Sharding Lab version: 2.0.0 Last updated: 5/3/2017 CONTENTS OVERVIEW ................................................................................................................................................... 3 Key Messages ........................................................................................................................................... 3 Key Technologies ...................................................................................................................................... 3 Time Estimates ......................................................................................................................................... 4 SETUP AND CONFIGURATION .................................................................................................................. 4 Task 1 –Running the Dependency Checker........................................................................................... 4 Task 2 - Configuring the Database Connection String .......................................................................... 5 Task 3 – Creating and Populating Shard Databases with Sample Data ................................................ 5 DEMO FLOW ................................................................................................................................................ 7 OPENING STATEMENT............................................................................................................................... 9 STEP-BY-STEP WALKTHROUGH .............................................................................................................. 9 Load and Partition Data .......................................................................................................................... 10 Review the Sample Application............................................................................................................... 13 Query and Insert Partitioned Data .......................................................................................................... 19 SUMMARY .................................................................................................................................................. 26 Overview This document provides setup documentation, step-by-step instructions, and a written script for showing a demo of SQL Azure. This document can also serve as a tutorial or walkthrough of the technology. In this demo you will see the basics of inserting and querying data from a sharded SQL Azure database. For additional demos of the Windows Azure platform, please visit http://www.azure.com. Note: In order to run through this demo, you must have a SQL Azure developer account. For more information on how to purchase an account, visit the SQL Azure Portal at http://www.microsoft.com/windowsazure/sqlazure/. Key Messages In this demo you will see three key things: 1. Database sharding adds complexity to your application but it’s managable 2. You can abstract the partitioning strategy from your individual queries 3. Running queries in parallel is important for performance Key Technologies This demo uses the following technologies: 1. SQL Azure 2. .NET Framework 4.0 3. Microsoft Visual Studio 2010 4. Windows Azure Tools for Microsoft Visual Studio Time Estimates  Estimated time for setting up and configuring the demo: 10+ mins (depending on how much sample data you generate)  Estimated time to complete the demo: 20 min Setup and Configuration The setup and configuration for this demo involves the following tasks: 1. Run the Dependency Checker 2. Configure the Database Connection String 3. Create and Populating Shard Databases with Sample Data Task 1 –Running the Dependency Checker This demo does not have any advanced configuration requirements. You simply need to have the prerequisites installed and have an account for SQL Azure. For more information on how to purchase an account, visit the SQL Azure Portal at http://www.microsoft.com/windowsazure/sqlazure/. The following steps describe how to run the Dependency Checker utility included with the demo to verify that you have the prerequisite components. You can skip this exercise if you are confident that the prerequisites are properly installed and configured. 1. Open a Windows Explorer window and browse to the demo’s Source\Script folder. 2. Double-click the Dependencies.dep file in this folder to launch the Dependency Checker tool and install any missing prerequisites. 3. If the User Account Control dialog is shown, confirm the action to proceed. Note: This process may require elevation. The .dep extension is associated with the Dependency Checker tool during its installation. For additional information about the setup procedure and how to install the Dependency Checker tool, refer to the Setup.docx document in the Assets folder of the training kit. Note: As shipped, the connection string template in the configuration file points to a local (default) instance of SQLExpress. To load data into SQL Azure, you will need to update the connection string in the configuration file to use your SQL Azure account settings. Task 2 - Configuring the Database Connection String To use SQL Azure during the demo, you will need to update the connection string used by the application. 1. Open the ServiceConfiguration.cscfg file in the \code\sharding.cloud\ folder and update the connection string in the ShardsDB setting with your SQL Azure account information. 2. Open the Web.config file in the \code\sharding.web\ folder and update the connection string in the ShardsDB setting with your SQL Azure account information. 3. Open the App.config file in the \assets\ShardDbLoader\ folder and update the connection string in the ShardsDB setting with your SQL Azure account information. Task 3 – Creating and Populating Shard Databases with Sample Data The sample application in this demo requires its shard data to be stored in SQL Azure. Populating the necessary databases takes time so you need to ensure that this procedure has completed before you start the demo. Prepare this step well in advance. The demo assets include an application that will generate random data and populate the shard databases in SQL Azure. You can control the amount of data produced by specifying how many contacts and orders the program needs to generate. In addition, the shard loader supports two separate partitioning strategies, specifically by date and by country. Running the program with one partitioning scheme does not overwrite the data generated with the other, which means that you could have two separate sets of databases, one for each partitioning strategy. You specify program options in its command line. To view available options, type SHARDDB /help. Note: Actually, a user can create only five databases on SQL Azure. For this reason, you will not be able to create both partitioning strategies (date and country) on SQL Azure at the same time using the same user. Since, this demo script has been prepared to show how sharding can be performed on SQL Azure using a country partition strategy; you should use the ShardDB command line application implementing a country partitioning strategy when you run it against SQL Azure. Figure 1 Shard loader application To run the application, follow these steps: 1. Open a command prompt and change the current directory to the assets\ShardDbLoader folder of the demo. If you executed the Configuration Wizard, it should contain a SHARDDB.EXE executable that resulted from the successful compilation of the solution; otherwise, build the solution in the code folder of the demo installation directory. 2. To create a new set of sample data, run the loader application specifying the /create option and the number of orders and contacts, as well as the partitioning strategy to use. For example, to generate 5000 orders and 300 contacts using the country partitioning strategy, run the shard loader with the following command line. SHARDDB.EXE /create orders:5000 contacts:300 partition:country 3. To delete data generated by this application, you need to run the loader with the /delete option and specify which partitioning strategy you wish to delete. For example, to delete the data generated by the country partitioning scheme, specify the following command line. SHARDDB.EXE /delete partition:country Demo Flow The following diagram illustrates the high-level flow for this demo and the steps involved: Figure 2 Demo Flow Opening Statement During this demo, we will examine a sample application that was built to highlight some of the fundamentals of building apps that connected to sharded databases. It’s critical to understand when it’s appropriate to shard your database and that the decision to do so should not be taken lightly as it adds an additional level of complexity to your application. For certain scenarios the benefits of database scale out can be huge. This is particularly true for applications that require massive throughput. There are many different approaches for scaling out your databases. It’s very important to determine the right strategy as it will directly impact the complexity and performance of your solution. This demo focuses on a scenario where you have a workload for a single application spread across multiple databases with the need for you application to determine dynamically which database to connect to for a given query and the ability combine the result sets. For some scenarios, especially for Independent Software Vendors (ISVs) a much simpler strategy may be to assign each customer their own database, removing the need to allow for queries across multiple databases. When using a scale out database strategy you get the added resources of each of the machines processing your workload. Having one 10 GB database is not the same as having ten 1 GB databases. When you have ten 1GB databases you have distributed your workload over many more machines which is particularly important for high throughput. In future versions of SQL Azure we can expect to see more features for helping to managing the partitioning of your data among multiple databases and fan-out queries. In this demo you will specifically see three key things: 1. Database sharding adds complexity to your application but it’s managable 2. You can abstract the partitioning strategy from your individual queries 3. Running queries in parallel is important for performance Step-by-Step Walkthrough This demo consists of the following segments: 1. Load and Partition Data 2. Review the Sample Application 3. Query and Insert Partitioned Data Load and Partition Data In this segment, you load sample data into a local SQLExpress instance to show how the partitioning strategy affects the distribution of data in the shards. Action Script 1. Open a command prompt and change the current directory to the assets\ShardDbLoader folder of the demo installation directory.  To start this demo, we first need to populate the store used by the application with sample data. To do this, we will use a console application that loads random data into the databases. Each database contains a copy of the tables used by the sample application, namely Contacts, Products, SalesOrderHeader, and SalesOrderDetail.  As you will see later in the demo, the sample application uses SQL Azure for storage, but to illustrate how the partitioning strategy affects the distribution of data in each partition, we will initially use a local SQL Server instance to create two separate sets of shards, each partitioned with a different strategy. Developing locally and deploying to SQL Azure is a pattern we expect to see for building applications. 2. Execute SHARDDB.EXE specifying a country partitioning strategy. You can use the default values for the inserted contact and order count parameters. SHARDDB /create partition:country Note: If you previously executed this demo script, you will already have created the partitioned databases. To delete them, use the /delete parameter in the loader application and specify the partitioning strategy for which to delete data. For example, SHARDDB /delete partition:country Note: As shipped, the connection string template in the configuration file points to a local (default) instance of Screenshot SQLExpress. If you previously configured the connection string to load data into SQL Azure, you will need to restore its original value to use local storage.  Note that some of the tables contain reference data and its content is not partitioned; instead, the tables are replicated to each shard. This is the case for both the Contacts and Products tables. On the other hand, the data in the SalesOrderHeader table is partitioned using a configurable strategy. In addition, SalesOrderDetail is considered a child of the SalesOrderHeader table. Hence, each row in this table is stored in the shard of its associated row in the parent table. In other words, all detail rows for an order are stored in the same shard as the order header.  We will now create our data set. First, we will partition the rows by country. The shard loader application allows us to specify the partitioning strategy in its command line. We can also specify the number of contacts and orders inserted into each shard but we will use the default values for now. This will allows us to insert a limited number of rows into the local SQL Server instance, enough to show the distribution of data based on a partition strategy. 3. Open SQL Server Management Studio and connect to the local SQL Server instance. 4. In the Object Explorer, locate the set of databases that correspond to the country partitioning masscheme. They should be named COUNTRY_00 to COUNTRY_0N.  Now that we have created our sample data, we will use SQL Server Management Studio to examine the data inserted by the loader and see how it has distributed the information in the shards.  First, we will view the contents for the SalesOrderHeader table in one of the databases in the set. You can see that the Country column shows that the table only contains rows for a single country.  Looking at the contents of the same table in one of the other databases in this set, we can see that it contains data for a different country.  Next, we will repeat the loading process but this time specifying that the data be partitioned by date or, to be precise, by quarter. 5. Pick one of the databases in this set, expand the Tables node, right-click the SalesOrderHeader table, and choose Select Top 1000 Rows. 6. Repeat the process to show the SalesOrderHeader in one of the other databases in this set. 7. Execute the SHARDDB application once again, this time specifying a date partitioning strategy. SHARDDB /create partition:date 8. Switch back to SQL Server Management Studio. 9. In the Object Explorer, locate the set of databases that correspond to the data partitioning scheme. They should be named QUARTER_01 to QUARTER_4. 10. Show the contents of the SalesOrderHeader table for one of the databases in this new set.  Once again, we can look at the data inserted by the loader in SQL Server Management Studio. As you can see, it has created a new set of databases, one for each quarter.  Examining the contents of the SalesOrderHeader table, we can see that in this new partitioning scheme, the table now contains orders for a single quarter of the year. Note that in this demo, for simplicity, the date partitioning scheme only takes into account the quarter so that orders for different years could potentially be allocated to the same shard. Review the Sample Application In this segment, you review the sample application and provide a brief description of its implementation. Action Script Screenshot 1. Start Visual Studio 2010 as an administrator (required to run in the development fabric).  I will now give you a quick tour of a sample application used in this demo. We’ve built a simple Windows Azure web application designed to highlight some features of database sharding. Let’s start Visual Studio 2010 and open the solution.  The solution contains three projects. The first one is a sample Web site that allows us to query and insert orders into a SQL Azure data store. Next, we have a cloud service project that we use to host the site as a Web role in Windows Azure. Finally, the third project in the solution contains the console application that we used earlier to load and partition our sample data.  As we discussed previously, data is partitioned based on a given strategy. For this application, the data access class is designed to accept pluggable partitioning strategies. A partitioning strategy is simply a class that implements a special interface named IShardPartitionStrategy. Let me briefly describe how it works. 2. Open the Microsoft.Samples.Sharding.sln solution in the code folder of this demo. 3. Open the IShardPartitionStrategy.cs file in the Partition folder of the Sharding.Web project and show its methods.  The GetShardFor method is the core method in this interface. The purpose of this method is to map any row in a table to a specific shard based on the value of its columns.  To implement the GetShardFor method, it requires metadata that specifies whether a table needs to be partitioned and if so, the name of the field to use as the partition key. The interface exposes this metadata through its PartitionMetadata property.  The Shards property returns a list of identifiers for every shard available in a partitioning scheme. To illustrate with an example, if we use a strategy that partitions by country, it would return a list that contains a shard identifier for each country. An application can map each identifier in the list to a connection string that points to the corresponding target database.  Finally, we have a Name property that returns a description of the partitioning strategy which could be used to identify how the data is partitioned. 4. Open the ByDatePartitionStrategy.cs file in the Partition folder of the Sharding.Web project.  To create a partitioning strategy, you need a class that implements the IShardPartitionStrategy interface. In the sample, we provide two different strategies, the first partitions data based on the country where the order is shipped, while the second one partitions based on the quarter of the order. The shard loader application that we used at the start of the demo shares the same partitioning strategies to load its data. Let’s quickly look at the implementation for each of these classes.  First, we will review the ByDatePartitionStrategy. Here we have the metadata that determines how each table is partitioned. Notice that it contains an entry for each of the tables. Each table has metadata that specifies whether the table needs to be partitioned (or Sharded), and which partition field to use. Tables that are not partitioned are identified as Global. 5. Highlight the EntityPartitionMap field at the top of the class. 6. Briefly describe the implementation of the GetShardFor method.  Next, we will look at the implementation of the GetShardFor method. The method starts by retrieving the metadata for the chosen table to determine whether it needs to be partitioned.  For “Global” tables, it returns a list of every available shard. This is for the benefit of the loader application, which can replicate the data to each shard. For queries, any of the shards in this list can be used to retrieve data since they all contain the same data.  For “Sharded” tables, the method retrieves the value of the partition key field, it ensures that it is of DateTime type, and then extracts its quarter.  Finally, it calls the GetShardName using the month value as a parameter. This last method simply returns an identifier of the form QUARTER_{0}, where the placeholder is replaced by the quarter value based on the month. The shard identifier is then returned to the application.  The ByCountryPartitionStrategy is very similar. Examining the GetShardFor method in this class, we see that it too retrieves the value of the partition key field and searches an array of countries for this value. It then uses the index of the matching entry to call the GetShardName method, which returns a shard identifier of the form COUNTRY_{0}, where the placeholder is replaced by the index of the country.  For this demo, we will run the sample app locally using the development fabric.  Regardless of its execution environment, you can see from the connection string in the configuration file that the application uses storage in SQL Azure. Notice that the Initial Catalog setting in the connection string includes a placeholder for the database name. The application uses this connection string as a template and maps each shard to a different database by replacing the database name with the shard identifier. Open the ByCountryPartitionStrategy.cs file in the Partition folder of the Sharding.Web project to discuss the implementation of the GetShardFor method. 7. Open the ServiceConfiguration.cscfg file in the CloudService project and locate the shardsDB setting in the configuration. 8. Show that the connection string points to database.windows.net and that the database name (Initial Catalog) is specified as a placeholder. 9. Open the Default.aspx page and switch to design view to show that different queries are available.  To complete our review of the sample application, let’s examine its user interface. The application is very simple and contains a main page where you execute different queries against each of the partitions.  Each query exposed in the UI is associated with a method in the data access class. Query and Insert Partitioned Data In this segment, you show how to query and insert data in the partitioned database. Action Script Screenshot 1. Press F5 to run the sample application.  Let’s start the application and see it in action. We have already pre-loaded a sample set of data into SQL Azure and we are going to use that for this demo.  First, we will execute a query that retrieves all orders for a given country. Given the current partitioning strategy, "partition by country", this query only needs to query a single shard to retrieve its results. Notice the Shard Name column that shows the name of the database where each row is stored and in particular, how in this query, every row belongs to the same shard.  If we now choose a different country, we can see that we are now retrieving data from a different shard.  We will now see how this is implemented. We can place a breakpoint in the data access class and step through the method that executes this query.  The GetOrdersByCountry method first constructs a SqlCommand object with a query that retrieves all orders from the SalesOrderHeader table using the country name as a parameter to filter the rows and get back a single country.  It then calls the GetShardFor method of the current partitioning strategy to 2. Select the Orders by Country tab in the application UI. 3. Pick a country from the drop down list and click the green arrow icon. 4. Draw attention to the Shard Name column in the results grid. 5. Select a different country and repeat the query. 6. Show that the Shard Name displayed for this new set of results differs from the previous one. Note: As shipped, the connection string template in the configuration file points to a local (default) instance of SQLExpress. To use SQL Azure, you need to update the ServiceConfiguration.cscfg file in the CloudService project. 7. Switch to Visual Studio, open the DataAccess.cs file in the Model folder, and locate the GetOrdersByCountry method. Place a breakpoint at the start of this method. 8. In the browser window, execute the query once again to force execution to stop at the breakpoint. 9. Press F10 to single step through the code until you reach the line following the call to PartitionStrategy.GetShardFor method. 10. Place the cursor over the shards variable and press SHIFT+F9 to open a QuickWatch window for this variable. Show that the list of shard identifiers contains a single element. 11. Press F5 to resume execution. determine which shards it needs to query and stores the result in the shards variable. Notice that we are passing the value of the selected country into this method.  If we examine the result stored in the shards variable, we can see that it contains a single shard identifier. This matches our expectation since, given the current partitioning strategy, we only need to query one shard to retrieve the results for a single country.  The method then calls ExecuteWithThreadPool to execute the query using the thread pool. We will talk more about this method when we discuss querying in parallel across shards.  Finally, it returns a DataTable with the result of the query, which we can use to bind to the application UI. 12. Select the Orders by Product tab in the application UI.  Now that we have seen a query that runs within a single shard, we will turn our attention to queries that can span multiple shards.  We will now retrieve all orders that contain a given product. Because the partitioning scheme splits rows across different shards based on their country, we need to query every shard to retrieve this information.  Looking at the result of this query, notice how it contains data from different countries and in particular, how the shard name column shows that the information was retrieved from multiple shards.  We will now look at the code that was used for this query. In fact, we can immediately see that it is very similar to the code in the GetOrdersByCountry method that we saw previously. Naturally, the SQL statement in the query is different. The other notable difference is that we are now passing in a ProductID value to the GetShardFor method.  If we step through this code past the call to GetShardFor and examine the value of the shards variable, we can see that in this case, it contains a list with multiple shard identifiers, which means that we must query each one of these shards to retrieve the results of the query. 13. Select a product from the drop down list and click the green arrow icon. 14. Switch to Visual Studio, open the DataAccess.cs file in the Model folder, and locate the GetOrdersByProduct method. Place a breakpoint at the start of this method. 15. In the browser window, execute the query once again to force execution to stop at the breakpoint. 16. Press F10 to single step through each line of code until you reach the line following the call to PartitionStrategy.GetShardFor method. 17. Place the cursor over the shards variable and press SHIFT+F9 to open a QuickWatch window for this variable. Show that the list of shard identifiers contains multiple elements. 18. Press F11 to step into the ExecuteWithThreadPool method.  This time we are going to step into the ExecuteWithThreadPool method and see how it executes the query in parallel over multiple shards and merges the partial results from each thread to return its result.  The parameters of this method include a SqlCommand object that specifies the query to execute, and a list of shard identifiers.  To process several queries in parallel, the method iterates over the list of shard identifiers and queues worker jobs in the thread pool to execute the query simultaneously over separate connections that point to each shard. To initialize a connection, it generates a connection string from the template in the configuration file, which we saw earlier in the demo.  After queuing all the queries, it waits for the threads to complete and calls the MergeResults method to assemble the partial results from each query into a single DataTable.  As you can see, although interacting with a sharded database adds some complexity to your application it’s manageable and can be worth the performance benefits of using a scaled out database solution. 19. Step through the code to show a single iteration of the loop, briefly describing how it works. 20. Press F5 to resume execution. 21. Select the Insert Order tab in the application UI. 22. Enter a new Order Date, select a Country.  Next, we will see what happens when we insert new orders into the data store.  Here, we insert a new order into the database. For the purposes of this demo, we are only going to enter the Order Date and Country fields, which we assume is the country where the order is to be shipped. Note that this can be different from the country of the contact. The remaining fields are filled with random data, except for the ID of the Contact, which will always be 1.  Let’s choose a country, say “Venezuela”, and select today as the order date. This will make it easier to remember when we have to locate the inserted order later on. 23. Record the values chosen for order date and country. 24. Explain that a new order is inserted with default values (contact id #1 is automatically assigned). 25. Switch to Visual Studio and locate the InsertOrder method in the DataAccess.cs file. Place a breakpoint at the start of this method. 26. Switch back to the browser window, and click the green arrow icon to insert the new order and pause at the breakpoint.  We can set a breakpoint in the InsertOrder method of the data access class to see its implementation. The method uses classic ADO.NET techniques to build a SQL command that inserts the values of the order into the database.  Once again, the GetShardFor method of the partitioning strategy determines the shard where the new order is written. If we examine its return value, we can see that it contains a single shard identifier, which the method uses to create a connection string using the template in the configuration file.  Finally, it inserts the order by opening a connection to the database using the mapped connection string and executing the command. 27. Continue execution until the line following the call to GetShardFor. 28. Place the cursor over the shards variable and press SHIFT+F9 to open a QuickWatch window for this variable. Show that the list contains a single element. 29. Press F5 to resume execution. 30. Select the Orders by Date tab.  Next, we will locate the order that we have just inserted. To do this, we will retrieve the orders for today. Remember that we used today’s date when we inserted the order.  Notice that the order has been placed in the COUNTRY_04 shard, which is the shard that contains data for the country that we picked when we inserted the order, that is “Venezuela”. 31. Enter the date that you specified when you previously inserted the order and click the green arrow icon. Note: The “Orders by Date” query retrieves all the orders for a given date. Given the random contents of the database, there is a chance that the results may contain additional orders. You can identify the correct order by its contact, which should be Baldwin Museum of Science (1). Summary In this demo, you saw the fundamentals of building an application uses a database sharding strategy to scale out over multiple machines. Hopefully you take away the key points that: 1. Database sharding adds complexity to your application but it’s managable 2. You can abstract the partitioning strategy from your individual queries 3. Running queries in parallel is important for performance Making the decision to use database sharding is something that should be cafefully considered as it can have deep implications on your solution. A Proof Of Concept is being developed that show how to use the Microsoft Entity Framework with database sharding in order to manage some of this complexity abstracting this complexity from the rest of your application. Check the following link for more information http://go.microsoft.com/fwlink/?LinkID=161238&clcid=0x409

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Scaling Out SQL Azure with Database Sharding