Download Document

DATA WAREHOUSING: SQL SERVER PARALLEL DATA WAREHOUSE AU3 UPDATE Dandy Weyn Sr. Technical Product Manager This document has been prepared for limited distribution within Microsoft. This document Microsoft Corporation contains materials and information that Microsoft considers confidential, proprietary, and significant for the protection of its business. The distribution of this document is limited to those solely involved with the program described within. @ilikesql Confidential and Proprietary © 2011 Microsoft Last Updated: Monday, May 22, 2017 FAST GROWING INDUSTRY AND ENTERPRISE DATA.. Problem: DataWarehousing systems continue to grow at fast pace New types of large data sets and sources have emerged Data is not in uniform format and shape What is needed? A solution that: Scales from few TBs to PBs of data Allows adding capacity/power as needed Offers variety of choices tailored towards custom needs Handles all the data: Structured, semi-structured and unstructured Unicode and Non-Unicode MICROSOFT DATA WAREHOUSE OFFERINGS Effort to Build Very High Very Low Moderate Moderate Moderate Moderate Very Low Capacity Variable 5 TB 14 TB 20 TB 40 TB 80 TB 500 TB+ Concurrency Variable Light Light Medium Medium High Very High Medium Medium Medium Medium High Very High Query Complexity Variable SQL SERVER | APPLIANCES SQL SERVER PARALLEL DATA WAREHOUSE • Tier-1 Enterprise Data Warehouse Appliance Offering • High scalability from tens to hundreds of terabytes • High performance through the MPP system • Flexibility and Choice • Choice of deployment options through distributed architecture • Most Comprehensive Solution • Complete data warehouse solution spanning desktop, enterprise data warehouse, and data marts PDW – CLIENT CONNECTIVITY SQL SQL Client Drivers SQL SQL SQL SQL Support/Patching SQL SQL ETL Load Interface SQL SQL SQL Corporate Backup Solution CONTROL RACK DATA RACK MICROSOFT PDW APPLIANCE – POWERED BY DELL PowerEdge R610 MD3620f Storage Nodes Database Servers Control Nodes (R710) Active / Passive Client Drivers Landing Zone (R510) Dual Fiber Channel Data Center Monitoring Dual Infiniband Management Servers (R610) ETL Load Interface Backup Node (R710 and MD3600f w/MD1200’s) Corporate Backup Solution Corporate Network Spare Database Server Private Network PDW – QUERY PROCESSING SQL ??? QUERY SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL SQL CONTROL NODE SQL SQL SQL  Client connections always go through the control node  Contains no persistent user data  SQL advantages: Parallel Data Warehouse SQL oProcesses SQL requests SQL oPrepares execution plan oOrchestrates SQL distributed execution  Local SQL Server processes final query plan and aggregates results SQL  Provided by DataDirect SQL oOpen database connectivity (ODBC), object linking and embedding database (OLE DB), Java SQL Database Connectivity (JDBC), and ActiveX® Data Objects (ADO.net) client drivers oWire protocolSQL (SeQuel link) oDrivers are available for 32 bits and 64 bits MANAGEMENT NODE SQL SQL SQL  Provides Support and Patching for the Appliance SQL  Holds image for re-deployment of compute SQL node  Holds Active Directory SQL SQL SQL SQL SQL SQL LANDING ZONE SQL SQL SQL  Provides high-capacity storage for data files from ETL processes SQL  Is available as a sandbox for other SQL applications and scripts that run on the internal networkSQL  Provides SQL Server Integration Services SQL SQL SQL Source Landing Zone SQL Files DWLoader or SQL Server Integration Services SQL Data Loader Compute Nodes SQL • Data Rack Servers 10 active + 1 passive SQL SQL • InfiniBand, FC and Ethernet switching • Expansion Grow from 1–4 data racks, storage options, test/dev system SQL SQL SQL SQL • Consists of COMPUTE NODES and STORAGE NODES SQL SQL SQL SQL COMPUTE NODE • Data Rack Servers 10 active + 1 passive SQL • InfiniBand, FC and Ethernet switching • Expansion Grow from 1–4 data racks, storage options, test/dev system  Each MPP node is a highly tuned symmetric multiprocessing (SMP) node with standard interfaces  Provides dedicated hardware, database, and storage  Runs SQL Server  Spare Node provides failover in case of node failure  Drives are configured as RAID 1 BACKUP NODE SQL SQL  Provides Integrated Backup Solution SQL  Integrates with SQL 3rd party backup option  Orderable in different sizes SQL SQL SQL SQL SQL SQL SQL COMPUTE NODE • Data Rack Servers 10 active + 1 passive SQL • InfiniBand, FC and Ethernet switching • Expansion Grow from 1–4 data racks, storage options, test/dev system  Each MPP node is a highly tuned symmetric multi-processing (SMP) node with standard interfaces  Provides dedicated hardware, database, and storage  Runs SQL Server  Spare Node provides failover in case of node failure  Drives are configured as RAID 1 DATA LAYOUT APPROACHES Replicated A table structure exists as a full copy within each discrete Parallel Data Warehouse node. Distributed A table structure is hashed on a single column and uniformly distributed across all nodes on the appliance. Each distribution is a separate physical table in the database management system (DBMS). Ultra Shared-Nothing Provides the ability to design a schema of both distributed and replicated tables to minimize data movement between nodes.  Small sets of data can be more efficiently stored in full (replicated).  Certain set operations (such as single-node operations) are more efficient against full sets of data. ULTRA SHARED-NOTHING ARCHITECTURE Extends Traditional Shared-Nothing Design  Pushes shared-nothing architecture into the SMP node—there is IO and CPU affinity within SMP nodes o Eliminates contention for user queries o Uses full resources for each user query  Provides multiple physical instances of tables o Distributes large tables o Replicates small tables  Redistributes rows as needed Provides Fault Tolerance  All hardware components have redundancy (including CPUs, disks, networks, power, and storage processors)  Control and compute nodes use failover clustering  Management nodes have active and standby states SQL SERVER 2008 R2 PARALLEL DATA WAREHOUSE APPLIANCE UPDATE 3 Improve Performance Broaden Functionality Expand Flexibility Cost Based Optimizer Collations and Stored Procedures Entry Appliances THEME: PERFORMANCE AT SCALE COST-BASED OPTIMIZER Goal: • Generate better execution plans Functionality: • Large space of execution alternatives explored • Best alternative picked based on the costing • Cost model that is sensitive to amount of data to be moved Benefits: • Leverages existing SQL Server optimizer and years of development • 10X or more performance improvement compared to AU2 • Plan adaptable to heuristics change TPCH - Power Metric 60000 50000 40000 30000 20000 10000 0 Power Metric AU2 AU3 19711 54602 TPCH - Total Elapsed Time (s) 80,000 60,000 40,000 20,000 - Total Time AU2 AU3 59,314 9,969 THEME: PERFORMANCE AT SCALE ZERO DATA CONVERSIONS DMS CPU Utilization - TPCH Goal: Benefits: – Better resource, CPU, utilization – 6x or more faster move operations, compared to AU2 40 30 20 10 AU2 Q22 Q21 Q20 Q19 Q18 Q17 Q16 Q15 Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7 Q6 Q5 Q4 Q3 0 Q2 – Using ODBC instead of ADO.NET for reading and writing data – Minimizing appliance resource utilization for data moves 50 Q1 Functionality: 60 CPU (%) – Eliminate CPU utilization spent on data conversions AU3 Improvement Factor Replicated table load Shuffle Replicate Trim Broadcast 0 1 2 3 4 5 6 7 * Improvement factor calculated based on PDW PGQL THEME: PERFORMANCE AT SCALE PDW ENTRY APPLIANCE (”… FOR THE RIGHT PRICE …”) Goal: – Appliance for lower end of the market Functionality: – ~40% less processing power (4+1 Compute Nodes) – Up to 50TB disk capacity (4 Storage Arrays) – Dell based hardware reference architecture – Complete PDW functionality (no less, no more) Benefits: – ~40% cheaper than 1 rack appliance – The lowest cost/TB on the market – Increased flexibility and choice (appliances for different needs) THEME: SQL SERVER COMPATIBILITY STORED PROCEDURES Goal: – Common code encapsulation and reuse Functionality: – System and user-defined stored procedures – Invocation using RPC or EXECUTE – Support for: control flow logic, input parameters Benefits: – Enables common logic re-use – Allows porting existing scripts – Increases compatibility with SQL Server Syntax: CREATE { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;] ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;] DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;] [ { EXEC | EXECUTE } ] { { [database_name.][schema_name.]procedure_name } [{ value | @variable }] [ ,...n ] } [;] { EXEC | EXECUTE } ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;] THEME: IMPROVED INTEGRATION HADOOP CONNECTOR Goal: – Handle both structured and unstructured data Functionality: – Bi-directional (import/export) interface between MSFT Hadoop and PDW – Delimited file support – Adapter uses existing PDW tools (bulk loader, dwsql) – Data transfer to/from PDW Landing Zone node over FTP channel – Low cost solution that handles all the data – Additional agility, flexibility and choice Hadoop SQOOP based adapter Landing Zone Node Bulk Data Loader PDW agent HDFS dwsql PDW Benefits: Config file HDFS THEME: IMPROVED INTEGRATION Examples: Goal: – Support local and international customers / data Functionality: – – – – Fixed server level collation User-defined column level collation Supporting all Windows collations Allow COLLATE clauses in Queries and DML Benefits: – Store all the data in PDW w/ additional querying flexibility – Existing DDLs and Query scripts – SQL Server alignment and functionality CREATE TABLE T ( c1 varchar(3) COLLATE traditional_Spanish_ci_ai, c2 varchar(10) COLLATE …) SELECT c1 COLLATE Latin1_General_Bin2 FROM T SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2 DISTRIBUTED ARCHITECTURE / HUB - SPOKE SSRS Excel/Excel Services SharePoint SSIS PowerPivot FLEXIBLE BUSINESS ALIGNMENT Parallel database copy technology enables rapid data movement and consistency between EDW and data marts Supports user groups with very different service-level agreements (SLAs): • Performance • Capacity • Loading • Concurrency Create SQL Server 2012, Fast Track Data Warehouse for SQL 2012, and SQL Server Analysis Services Data Marts A distributed architecture gives you the flexibility to add or change diverse workloads or user groups while maintaining data consistency across the enterprise © 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document