* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download IBM
Survey
Document related concepts
Transcript
Scalable Data Management with DB2 Matthias Nicola IBM Silicon Valley Lab [email protected] © 2009 IBM Corporation Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 2 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2 Data Server Editions DB2 for z/OS DB2 Enterprise Edition / IBM InfoSphere Warehouse DB2 DB2 Workgroup Edition DB2 Express-C (free!) DB2 Everyplace 3 © 2009 IBM Corporation Information Management – DB2 Information Management Software Business Value of Scalability More historical data = more precise forecasts – Data mining needs a lot of data for pattern accuracy – OLAP needs a lot of data for forecast accuracy Predictable costs when growth occurs – Often the budget is the controlling factor, not technology – Low maintenance cost is important No forced migrations from technology limitations – Enabling very large databases 4 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2 Scalability for OLTP and Data Warehousing 5 Database Partitioning Feature (DPF) DB2 pureScale Range partitioning Multi-Dimensional Clustering (MDC) Compression Self-Tuning Memory Management (STMM) Automatic Storage Workload Management High Availability, Disaster Recovery Recovery Security and Compliance Utilities: Load, Backup & Restore, Redistribute Archiving etc. © 2009 IBM Corporation Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 6 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2's Database Partitioning Feature (DPF) select … from table Tables FCM network Engine Engine Engine Engine … data+log data+log data+log data+log Partition 1 Partition 2 Partition 3 Partition n Database Database is divided into multiple database partitions Database partitions run on same or separate servers (shared-nothing) Each partition has its own table spaces, log, configuration, etc. Data is spread over N database partitions Queries are executed in parallel on all database partitions 7 © 2009 IBM Corporation Information Management – DB2 Information Management Software Flexible configuration options Possible hardware configurations All database partitions on a single machine (logical partitions) easy exploitation of multi-core systems All database partitions on separate machines (physical partitions) Hybrid: multiple machines with several logical partitions on each FCM (Fast Communication Manager) DB2 Partition DB2 Partition DB2 Partition SMP server DB2 Partition SMP server DB2 Partition DB2 Partition DB2 Partition SMP server DB2 Partition SMP server I/O Channels I/O Channels Storage server Storage server Example: 4 physical machines, 2 database partitions per machine http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.partition.doc/doc/c0004569.html 8 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2's Database Partitioning Feature (DPF) …. …. 9 © 2009 IBM Corporation Information Management – DB2 Information Management Software The Distribution Map • Distribution key can consist of one or multiple columns. • Avoid low cardinality columns, such as "gender", "state", etc. • Unique indexes must contain all columns of the distribution key Distribution key C1 000120 column name column value DB2 hash algorithm 5 Distribution map i 0 1 2 3 4 5 6 7 … … p(i) 1 2 3 4 1 2 3 4 … … Partition1 10 Partition2 Partition3 32k Partition4 © 2009 IBM Corporation Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 11 © 2009 IBM Corporation Information Management – DB2 Information Management Software Single Server 12 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2 Database Partitioning Feature = Divide Work Database Partition 1 13 Database Partition 2 Database Partition 3 © 2009 IBM Corporation Information Management – DB2 Information Management Software Range Partitioning Further Reduces I/O Database Partition 1 Database Partition 2 Database Partition 3 January February CREATE TABLE sales (recordID INT, salesdate DATE, ... details XML) DISTRIBUTE BY HASH (recordID) PARTITION BY RANGE (salesdate) EVERY 1 MONTHS ; March 14 © 2009 IBM Corporation Information Management – DB2 Information Management Software Multi-Dimensional Clustering to Further Reduce I/O Database Partition 1 January February Database Partition 2 Database Partition 3 CREATE TABLE sales (recordID INT, salesdate DATE, productID INTEGER, storeID INTEGER, ... details XML) DISTRIBUTE BY HASH (recordID) PARTITION BY RANGE (salesdate) EVERY 1 MONTHS ORGANIZE BY (productID, storeID) ; March 15 © 2009 IBM Corporation Information Management – DB2 Information Management Software Compression Reduces I/O by a Factor of 3x to 4x Database Partition 1 Database Partition 2 Database Partition 3 January February March 16 © 2009 IBM Corporation Information Management – DB2 Information Management Software Data Partitioning and Placement Options Can distribute a table across some or all database partitions. Can replicate a table to have an identical copy on each partition. Database Partitions Part. 1 Part. 2 Part. 3 Part. 4 Part. 5 Part. 6 Part. 7 Part. 8 Table 1: Sales Table 2: Customer Table 3: Product 19 Table 3: Product Table 3: Product Table 3: Product Table 3: Product Table 3: Product Table 3: Product Table 3: Product (copy) (copy) (copy) (copy) (copy) (copy) (copy) © 2009 IBM Corporation Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 20 © 2009 IBM Corporation Information Management – DB2 Information Management Software Join Processing - Example create table tab1(pk1 int, c1 int,...) distribute by hash (pk1); create table tab2(pk2 int, c2 int,...) distribute by hash (pk2); Logical data in the tables: tab1 pk1 c1 1 3 2 3 3 4 7 7 8 12 11 10 12 15 Physical data distribution: database partition 1 tab2 pk2 3 4 5 7 8 10 12 15 c2 2 8 3 4 15 10 12 7 distribute by hash* tab1 pk1 c1 1 3 3 4 7 7 11 10 tab2 pk2 c2 3 2 5 3 7 4 15 7 database partition 2 tab1 pk1 c1 2 3 8 12 12 15 tab2 pk2 c2 4 8 8 15 10 10 12 12 *For simplicity, this example hashes odd key values to partition 1 and even key values to partition 2 21 © 2009 IBM Corporation Information Management – DB2 Information Management Software Collocated Join create table tab1(pk1 int, c1 int,...) distribute by hash (pk1); create table tab2(pk2 int, c2 int,...) distribute by hash (pk2); select * from tab1, tab2 where tab1.pk1 = tab2.pk2; • Both tables are partitioned by the join key • Any join matches are guaranteed to be within any given partition ("co-located") • No join matches across partitions partition 1 tab1 pk1 1 3 7 11 tab2 pk2 3 5 7 15 partition 2 tab1 pk1 2 8 12 tab2 pk2 4 8 10 12 • Allows local joins within each partition, no data movement • Best case, best performance 22 © 2009 IBM Corporation Information Management – DB2 Information Management Software Directed Join select * from tab1, tab2 where tab1.c1 = tab2.pk2; permanent storage partition 1 tab1 pk1 c1 1 3 3 4 7 7 11 10 tab2 pk2 3 5 7 15 on the fly / in memory partition 2 tab1 pk1 c1 2 3 8 12 12 15 partition 1 tab2 pk2 4 8 10 12 DTQ tab1' pk1 c1 1 3 2 3 7 7 12 15 tab2 pk2 3 5 7 15 partition 2 tab1' pk1 c1 3 4 11 10 8 12 tab2 pk2 4 8 10 12 Send rows from tab1 to those partitions where they can find join matches in tab2, i.e. redistribution of tab1, based on hashing of the join key c1. 23 © 2009 IBM Corporation Information Management – DB2 Information Management Software Single Partition Directed Join select * from tab1, tab2 where tab1.c1 = 3 and tab1.c1 = tab2.pk2; partition 1 tab1 pk1 c1 1 3 3 4 7 7 11 10 tab2 pk2 3 5 7 15 partition 2 tab1 pk1 c1 2 3 8 12 12 15 partition 1 tab2 pk2 4 8 10 12 DTQ tab1' pk1 c1 1 3 2 3 partition 2 tab2' pk2 3 Value predicates are used to optimize (reduce) the data flow and eliminate irrelevant partitions from the join processing. 24 © 2009 IBM Corporation Information Management – DB2 Information Management Software Repartitioned Join select * from tab1, tab2 where tab1.c1 = tab2.c2; partition 1 tab1 pk1 c1 1 3 3 4 7 7 11 10 tab2 pk2 c2 3 2 5 3 7 4 15 7 partition 2 tab1 pk1 c1 2 3 8 12 12 15 pk2 4 8 10 12 partition 1 tab2 c2 8 15 10 12 DTQ tab1' pk1 c1 1 3 2 3 7 7 12 15 tab2' pk2 c2 5 3 15 7 4 11 8 15 partition 2 tab1' pk1 c1 3 4 11 10 8 12 tab2' pk2 c2 3 2 7 4 10 10 12 12 DTQ Redistribute both tables by hashing on their join keys so that matching rows end up on the same partition. 25 © 2009 IBM Corporation Information Management – DB2 Information Management Software Broadcast Join select * from tab1, tab2 partition 1 tab1 pk1 1 3 7 11 tab2 pk2 3 5 7 15 partition 2 tab1 pk1 2 8 12 tab2 pk2 4 8 10 12 partition 1 tab1' BTQ pk1 1 3 7 11 2 8 12 tab2 pk2 3 5 7 15 partition 2 tab1' pk1 1 3 7 11 2 8 12 tab2 pk2 4 8 10 12 Broadcast a copy of one table to all database partitions. 26 © 2009 IBM Corporation Information Management – DB2 Information Management Software Data Placement Option: Replicated Table permanent storage partition 1 tab1 pk1 1 3 7 11 2 8 12 tab2 pk2 3 5 7 15 partition 2 tab1(copy) pk1 1 3 7 11 2 8 12 tab2 pk2 4 8 10 12 Good choice for small tables with infrequent insert/update/delete activity, such as dimension tables in a star schema. 27 © 2009 IBM Corporation Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 28 © 2009 IBM Corporation Information Management – DB2 Information Management Software Scalability vs. Performance Performance: Time to complete a given task with given resources Scalability: Ability to add resources to – complete the same task more quickly – handle a bigger task in about the same time Example: Mowing the lawn… – Peter does it alone in 8 hours – Peter and Bob work together and take 4 hours Scalability is perfect, performance is poor! – Jim does it alone in 1 hour – Jim and John together do it in 1hrs20min Performance is great, scalability is awful ! – Mary mows the lawn in 30 minutes – Mary and Susan together need 15 minutes Performance is great, scalability is also great ! 29 © 2009 IBM Corporation Fixed Database Size Mathematically, these two approaches are equivalent…. # of partitions Make queries against a DB of a fixed size faster by adding partitions (“speedup”). Amount of data per partition shrinks. Query elapsed time Query elapsed time Scalability Metrics Increasing Database Size Database size & # of partitions Hold response time constant for a growing database by adding partitions in proportion (“scaleup”/"scale-out"). Amount of data per partition remains constant. Basic assumption: Queries executed against a bigger database examine more data 30 Our Test Design Query elapsed time Increasing database size: 250GB / 500GB / 1TB Increasing number of database partitions Fixed ratio of data volume to number of partitions n partitions 250GB 31 n*2 partitions 500GB n*4 partitions 1 TB Show constant query elapsed times to prove scalability TPoX Benchmark TPoX = Transaction Processing over XML Data Open Source Benchmark: http://tpox.sourceforge.net/ Financial transaction processing scenario: “online brokerage” Realistic test for XML databases Custacc Customers Customer 1 n Account 1 11 n Order Brokerage House DB CustAcc.xsd 4 – 20 kb FIXML (41 XSD files) 1 – 2 kb n Holding n 1 n 1 Security Security.xsd 2 – 9 kb FIXML: Standardized Financial XML Schema for Securities Trading ! 32 Document structures and join relationships ID Name DateOfBirth Address Phone … Account Order CustAcc ID Currency OpeningDate Balance … Holding Symbol Name Type Quantity Holding ID OrignDt TrdDt Acct Side Qty Sym … Symbol Name Type Quantity 33 ID Currency OpeningDate Balance … Holding Symbol Name Type Quantity Sector Industry Category OutstShares FundInformation Holding… Account ID Security Symbol Name SecurityType SecurityInformation StockInformation FundFamily Sector Industry AssetGroup FixedIncome ExpenseRatio TotalAssets MinInitialInvestment MinSubsequentInvest. Price/LastTrade Ask/Bid 50DayAvg 200DayAvg … TPoX Data & Schema Customer 1 n Account 11 n Holding n 1 FIXML: financial industry XML Schema CustAcc: modeled after a real banking system that uses XML Security: information similar to investment web sites 34 n Order CustAcc.xsd 1 n FIXML (41 XSD files) 1 Security Security.xsd Database schema for a non-DPF DB2 database: create table custacc ( cadoc XML ) create table security ( sdoc XML ) create table order ( odoc XML ) Scale Factor “M”, 1 TB raw data 500M Order documents, 50M CustAcc documents 20,833 Securities, independent of scale factor 3 Simple Tables + XML Indexes TPoX Database Schema for DPF - Extract certain XML element values into relational cols as distribution keys - Goal: enable partitioning of both tables by a common key ! ID Name DateOfBirth Address Phone … Account Order custid secsym integer varchar 35 ID OrignDt TrdDt Acct Side Qty Sym … odoc XML order table (500M rows) custid integer CustAcc ID Currency OpeningDate Balance … Holding Symbol Name Type Quantity Holding Holding… Account Symbol Name Type Quantity ID Currency OpeningDate Balance … Holding Symbol Name Type Quantity cdoc XML custacc table (50M rows) What is TPoX-DSS*? • Decision Support workload on top of the regular XML data of the TPoX benchmark • A set of complex SQL/XML queries – Includes massive table scans, aggregation, grouping, OLAP functions, etc. • Focus on single-user query response time 36 * we might come up with a better name in the near future Business Questions Complex SQL/XML Queries Q1: Popular Securities Find securities that have more shares bought than sold across all orders. List their order quantities grouped by year. Q2: Top 10 Most Popular Trading Weeks, Ranked by Order Volume For each year, find the ten most active weeks and return the buy, sell, and total order volumes for each week. Q3: Average Account Balance of Premiun Customers Calculate the average account balance of all premium customers, grouped by their number of accounts. Q4: Average Balance per Number Of Accounts Calculate the average account balance of all customers, grouped by their number of accounts. Q5: Percentage of buy orders per sector and gender For each stock in a given sector of securities, find the percentage of buy orders placed by male vs. female clients. 37 Business Questions Complex SQL/XML Queries Q6: Max Stock Orders for an Industry List the 20% (or: x%) most expensive orders for customer in a given state and for a given industry (subset of securities). Q7: Order Amounts for Two Major Currencies Calculate the min, max and avg order amount for all orders in a given timeframe grouped by buy/sell for two major currencies. Q8: Order Amounts for All Currencies Calculate the min, max and avg order amount for all orders in a given timeframe grouped by buy/sell and the order’s currency. Q9: Balance per Currency Each account is in a specific currency. Calculate the average account balance for each currency. Q10: Sleeping Customers Find all customers having less than x orders in a given timeframe. 38 TPoX DSS: Query Characteristics Query Tables Characteristics Q1 Popular Securities O, S 2 x XMLTABLE, Group By, Order By Q2 Top 10 Most Popular Trading Weeks O Full scan of all orders, OLAP Function rank() Q3 Average Account Balance of Premiun Customers C Indexed access to premium customers, Group By, Order By Q4 Average Balance per Number Of Accounts C Full scan of all customers Q5 Percentage of buy orders per sector and gender C, O, S Aggregation, SQL OLAP Functions, 3 x XMLTABLE, 2 x XMLEXISTS Q6 Max Stock Orders for an Industry C, O, S 2 x XMLTABLE, 2 x XMLEXISTS Q7 Order Amounts for Two Major Currencies O Several predicates, CASE expression Q8 Order Amounts for All Currencies O 4 aggregation functions, Group By two XML attributes Q9 Balance per Currency C Full scan of all accounts, aggregation and grouping Q10 Sleeping Customers C, O Common table expression All queries available upon request, in SQL/XML notation. 39 Q5: Percentage of buy orders per sector and gender SELECT DISTINCT secsector, gender, SUM(ordqty) OVER (PARTITION BY secsector, gender) AS orderqty, SUM(ordqty) OVER (PARTITION BY secsector, gender) * 100 / SUM(ordqty) OVER (PARTITION BY secsector) AS percentage FROM security, order, custacc, XMLTABLE(' declare namespace s="http://tpox-benchmark.com/security"; $SDOC/s:Security' COLUMNS secsector VARCHAR(30) PATH '*:SecurityInformation//*:Sector', secname VARCHAR(50) PATH '*:Name') AS T1, XMLTABLE(' declare default element namespace "http://www.fixprotocol.org/FIXML-4-4"; $ODOC/FIXML/Order' COLUMNS ordqty BIGINT PATH '*:OrdQty/@Qty') AS T2, XMLTABLE(' declare namespace c="http://tpox-benchmark.com/custacc"; $CADOC/c:Customer' COLUMNS gender VARCHAR(10) PATH '*:Gender') AS T3 WHERE order.secsym = security.secsym AND order.custid = custacc.custid AND XMLEXISTS(' declare namespace s="http://tpox-benchmark.com/security"; $SDOC/s:Security/s:SecurityInformation/*[s:Industry="OfficeSupplies" and s:MinInitialInvestment=5000]') AND XMLEXISTS(' declare default element namespace "http://www.fixprotocol.org/FIXML-4-4"; $ODOC/FIXML/Order[@Side = "2"]') ORDER BY secsector, gender; 40 Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 41 © 2009 IBM Corporation Data Partitioning in a Cluster Each node has 2 Intel Xeon 5169 dual-core CPUs, and 32GB RAM. 4 cores per node we use 4 database partitions per node. 8 processing nodes Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 8 database partitions, 250GB 16 database partitions, 500 GB 32 database partitions, 1TB 45 Node 7 Node 8 Scalability Results: Cluster TPoX/DSS Query Response Times (Cluster) Elapsed time (seconds) 250GB / 8 partitions 500GB / 16 partitions 1TB / 32 partitions Q1 Q2 Q3 Q4 Source: IBM internally measured results, September 2009 Q5 Q6 Q7 Q8 Q9 Q10 Query Query response times for 500GB and 1TB are close to the 250GB results! 47 Information Management – DB2 Information Management Software Agenda Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF) – Overview – Data partitioning, clustering, placement – Join Methods TPoX Scalability in a DPF database – Scalability vs. Performance – Benchmark configuration & results pureScale Overview Summary 48 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2 pureScale Goals Unlimited Capacity Any transaction processing or ERP workload Start small Grow easily, with your business Application Transparency Avoid the risk and cost of tuning your applications to the database topology Continuous Availability Maintain service across planned and unplanned events Webcast: http://www.channeldb2.com/video/db2-purescale-a-technology Web site: http://www.ibm.com/software/data/db2/linux-unix-windows/editions-features-purescale.html 49 © 2009 IBM Corporation Information Management – DB2 Information Management Software DB2 pureScale : Technology Overview Clients connect anywhere,… … see single database Clients Clients connect into any member Automatic load balancing and client reroute may change underlying physical member to which client is connected Single Database View DB2 engine runs on several host computers Member CS Member CS Integrated cluster services Member Member CS Co-operate with each other to provide coherent access to the database from any member CS Failure detection, recovery automation, cluster file system In partnership with STG (GPFS,RSCT) and Tivoli (SA MP) Low latency, high speed interconnect Cluster Interconnect Special optimizations provide significant advantages on RDMAcapable interconnects (eg. Infiniband) PowerHA pureScale technology CS 2nd-ary CS Log Log Log Shared Storage Access Database 50 Log Primary Efficient global locking and buffer management Synchronous duplexing to secondary ensures availability Data sharing architecture Shared access to database Members write to their own logs Logs accessible from another host (used during recovery) © 2009 IBM Corporation Information Management – DB2 Information Management Software Scale with Ease Without changing applications Efficient coherency protocols designed to scale without application change Applications automatically and transparently workload balanced across members Without administrative complexity Single Database View DB2 DB2 DB2 DB2 DB2 No data redistribution required Log Log Log Log Log To 128 members in initial release Limited by testing resources 51 © 2009 IBM Corporation Information Management – DB2 Information Management Software What is a PowerHA pureScale ? Software technology that assists in global buffer coherency management and global locking db2 agents & other threads Derived from System z Parallel Sysplex & Coupling Facility technology Software based Services provided include db2 agents & other threads log buffer, dbheap, & other heaps bufferpool(s) log buffer, dbheap, & other heaps bufferpool(s) Group Bufferpool (GBP) Global Lock Management (GLM) Shared Communication Area (SCA) Primary Log Log Members duplex GBP, GLM, SCA state to both a primary and secondary GBP GLM SCA Secondary Done synchronously Duplexing is optional (but recommended) Set up automatically, by default Shared database (Single database partition) 54 © 2009 IBM Corporation Information Management – DB2 Information Management Software The Role of the GBP GBP acts as fast disk cache Client A : Select from T1 Client C : where C2=Y Select from T1 where C2=Y Client B : Update T1 set C1=X where C2=Y Commit Member 1 Member 2 Dirty pages stored in GBP, then later, written to disk Provides fast retrieval of such pages when needed by other members bufferpool(s) bufferpool(s) GBP includes a “Page Registry” Keeps track of what pages are buffered in each member and at what memory address Used for fast invalidation of such pages when they are written to the GBP Force-at-Commit (FAC) protocol ensures coherent access to data across members DB2 “forces” (writes) updated pages to GBP at COMMIT (or before) GBP synchronously invalidates any copies of such pages on other members 55 New references to the page on other members will retrieve new copy from GBP In-progress references to page can continue GBP GLM SCA Page Registry M1 M2 © 2009 IBM Corporation Information Management – DB2 Information Management Software Stealth System Maintenance Goal: allow DBAs to apply system maintenance without negotiating an outage window Procedure: Single Database View DB2 DB2 DB2 DB2 1. Drain (aka Quiesce) 2. Remove & Maintain 3. Re-integrate 4. Repeat until done 57 Log Log Log Log Enables continuous availability © 2009 IBM Corporation Information Management – DB2 Information Management Software Achieving Efficient Scaling : Key Design Points Deep RDMA exploitation over low latency fabric Enables round-trip response time ~10-15 microseconds Lock Mgr Lock Mgr Lock Mgr Lock Mgr Buffer Mgr Silent Invalidation Informs members of page updates requires no CPU cycles on those members No interrupt or other message processing required Increasingly important as cluster grows GBP GLM SCA Hot pages available without disk I/O from GBP memory RDMA and dedicated threads enable read page operations in ~10s of microseconds 58 © 2009 IBM Corporation Information Management – DB2 Information Management Software of Transaction Throughput 61 © 2009 IBM Corporation Information Management – DB2 Information Management Software Questions / Discussion [email protected] 63 © 2009 IBM Corporation Information Management – DB2 Information Management Software Backup Slides 64 © 2009 IBM Corporation Information Management – DB2 Information Management Software Features to Minimize Planned Outages Backup: Fast, scalable, granular – – – – – – – Online or offline Fully parallel and scalable Can be throttled Partition-level backup Table space-level backup Full, Incremental, or Delta Volume snapshot support Load: Fast, scalable and granular – – – – Fully parallel and scalable Partition-level Online load Online index rebuild Automatic log management 65 Other utilities – Online statistics collection – Online index create and reorganization – Online reorganization – Online inspect Dynamic operations – Configuration parameters – Buffer pool operations – Container operations Space management – Online container management – Automatic storage – Online index reorganization © 2009 IBM Corporation Information Management – DB2 Information Management Software Features to Minimize Unplanned Outages Hardware failures – Integration with TSA cluster manager – Built-in redundancy can't be turned off – Consistency bits – Log mirroring – Automatic mirroring of critical data files – Support for RAID Fast recovery – – – – – 66 High availability – Clustering / failover support – Integrated with TSM – Automatic client reroute Human and Application Errors – Point-in-Time (POT) recovery – Drop table recovery Miscellaneous – Infinite active logging – Online container operations Continuous check pointing Parallel recovery Automatic recovery tuning Filtered recovery Dynamic debugging capability © 2009 IBM Corporation Information Management – DB2 Information Management Software OLAP Optimization Advisor InfoSphere Warehouse will design the aggregates to support dimensional analysis for you using: – Hybrid line – Statistics – Meta-data that describes the cubes • Hierarchies, dimensions, measures, etc. – Optimizes to understand impact to load times and performance trade-off 67 © 2009 IBM Corporation Information Management – DB2 Information Management Software Universal Cubing Services Access Portals, Web Applications, Dashboards, Interactive Reports, Ad Hoc Analysis, Common Desktop Tools IBM Cognos 8 BI IBM DataQuant & DB2 QMF Microsoft Excel Cubeware Cockpit Universal Cube Access (ODBO, XMLA) InfoSphere Warehouse 68 © 2009 IBM Corporation Information Management – DB2 Information Management Software InfoSphere Warehouse Data Mining Data Mining Embedded into Applications and Processes SOA Processes BI Analytical Tools Web Analytical Apps Mining Visualizer SQL Interface DB2 InfoSphere Warehouse • Enterprise-Level Data Mining • High-Speed, In-Database Scoring SQL Structured & Unstructured Data 69 Model Results Modeling In-Database Data Mining Scoring Functions © 2009 IBM Corporation Information Management – DB2 Information Management Software InfoSphere Warehouse Text Analytics Analyze and extract structured data from text – Makes data available to normal reporting and analysis tools – From customer call center records, claim forms, etc. Benefits – Target specific information hidden within text – Competitive edge by driving further business insight – Drives a greater ROI for your applications Business value examples – – – – – 70 Better product categorization Early warning on customer attrition Fraud detection Product defect analysis Better customer profiling Simple text analysis capabilities for text columns stored in warehouse tables Pattern matching rules and simple linguistics Enhance existing reports and data mining with insights gleaned from text Simple rules and dictionary editor © 2009 IBM Corporation Information Management – DB2 Information Management Software InfoSphere Warehouse Design Studio Leverage and extend InfoSphere Data Architect: – Design and modify database physical models (schema & storage design, etc) – Design and model OLAP objects – Design and model warehouse transformation and mining flows Key Features: Database design, or reverse engineer an existing database or DDL (RDA) – View/Modify the schema – Compare/Sync DB objects – Analyze design (best practices and dependencies), Validation – DB2 Storage Modeling: Table Space, Buffer Pool, Partition Generate script & Deploy: on data models, and flow models Impact Analysis: on data models and flow models 71 © 2009 IBM Corporation71 What’s new in TPoX 2.0 TPoX 2.0 includes pervasive change to the benchmark TPoX 2.0 test results not comparable to previous versions of TPoX Data Generator Workload and WorkloadDriver TPoX V1.3 and Earlier TPoX 2.0 TPoX V1.3 and Earlier TPoX 2.0 Based on Toxgene A single java based program 3rd party tool, lack of support Complete rewrite Workload description file in proprietary format, hard to read Workload description file in XML format, easy to read and create Slow (> 5 days for 1TB data) Fast (6 hours for 1TB data) Can’t generate dense account IDs for CUSTACC Account IDs are now dense WorkloadDriver reads input documents from large amount of small files Large amount of small XML files Small amount of larger files, each contains 50K XML documents WorkloadDriver reads input documents from smaller amount of larger files, improved performance for reading XML input documents Update transaction U1, U5 and U6 select account for update based on customer ID Update transaction U1, U5 and U6 select account for update based on account ID Data Distribution TPoX V1.3 and Earlier TPoX 2.0 # of CUSTACC vs # of ORDER 1:5 1:10 XML document size range 1-20KB 1-23KB ACCOUNT IDs of customer Not dense Dense Total XML document size of “100GB” scale Slightly less than 100GB Slight larger than 100GB avg # of accounts per customer 1.5 2.0 Changes have improved performance of generating and consuming TPoX XML data in large scale TPoX benchmarks ! NOTE: please refer to TPoX V2.0 Release Note at http://sourceforge.net/projects/tpox for more detail 72 More information on XML data management in DB2 for Linux, UNIX, Windows and DB2 for z/OS http://tinyurl.com/pureXML 73