Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
® Large Databases: Do’s & Don’ts •Angelo Sironi •Executive IT Architect – IBM •[email protected] Padoa, 11 June 2008 © 2008 IBM Italia S.p.A. © La gestione di grandi database ovvero... i volumi mettono in crisi le teorie... Cosa fare, e cosa non fare, per garantire performance e scalabilità alle applicazioni ed alle architetture che devono gestire terabytes di dati a.sironi11 June 2008 Padoa, 2 a.sironi 2 © Contents Asilomar, 1998-2008 XLDB Large Databases: How Large? Concerns Do’s & Don’ts A look into the future … Concluding remarks a.sironi11 June 2008 Padoa, 3 a.sironi 3 © Asilomar, August 1998 Database system research agenda for the next decade Plug & Play Data Base Management Systems Federate Millions of Database Systems Rethink Traditional Database System Architecture Smart-Data Unify Process and Data in Database Systems Integrate Structured and Semi-structured Data The Information Utility: Make it easy for everyone to store, organize, access, and analyze the majority of human information online. a.sironi11 June 2008 Padoa, 4 a.sironi 4 © XLDB, 2007 - SciDBMS, 2008 - CW, 2008 1st Workshop on Extremely Large Databases (SLAC, Oct. 2007) All of the industry representatives had more than 10 petabytes of data, and their largest individual systems were all at least 1 petabyte in size. SciDBMS Meeting 2008 (Asilomar, Apr. 2008) SciDBMS must be able to scale to databases of hundreds of petabytes, with individual tables measured in trillions of rows CW - May 22, 2008 “Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest” (Computerworld) Largest table: “multiple trillions of rows” “The 2 Pb database requires fewer than 1,000 PC servers” a.sironi11 June 2008 Padoa, 5 a.sironi 5 © What’s an XLDB for us? What’s an XLDB for us? Total size of all DB’s? Size of Largest DB? Size of largest Table / file? Larger than One Petabyte? NO! So what? Or it’s XLDB when it concerns us? But … why are we concerned? a.sironi11 June 2008 Padoa, 6 a.sironi 6 © Concerns related to VLDB / XLDB Sailing unknown oceans… Technology Experience Skills Methods Past negative experience Fear it might happen again a.sironi11 June 2008 Padoa, 7 a.sironi 7 © Concerns related to VLDB / XLDB Feasibility / cost Sizing Engineering Performance Throughput Availability Maintenance Administration …. Concerns apply to DW Systems as well to OLTP ones a.sironi11 June 2008 Padoa, 8 a.sironi 8 © Do’s & Don’ts (1/3) Define / identify non-functional requirements Be sound with definitions Identify most critical requirements Don’t confuse business requirements with technical implementations Size the environment Don’t follow opinions Base your decisions on proven fact Don’t be dogmatic – always understand the impact of your decisions Everything must be in Boyce - Codd Normal Form … or … Must always follow Kimball’s Dimensional Model using Surrogate Keys a.sironi11 June 2008 Padoa, 9 a.sironi 9 © Do’s & Don’ts (2/3) Don’t consider non-functional requirements as an afterthought(1) Set validation points Compare with sizing critical specs Prototype the unknown Measure Don’t rely on the first solution that comes to mind On most critical issues, be ready with alternatives Lateral thinking may help Prototype, prototype, prototype … Measure, compare and contrast (1) Afterthought = an addition to something already completed a.sironi11 June 2008 10 Padoa, a.sironi 10 © Do’s & Don’ts (3/3) Consider parallelizing everything But don’t forget Amdahl’s Law .. .. And Hot Spots … Automate everything Reduce complexity Reduce operating time Document Accurately With all relevant … details a.sironi11 June 2008 Padoa, 11 a.sironi 11 © Current Data Warehouse Challenges Total cost of ownership Large amount of data, large storage cost Huge hardware management cost a.sironi11 June 2008 Padoa, 12 a.sironi 12 © Current Data Warehouse Research Goals Dramatic TCO Reduction Extreme compression Ride the wave of commodity hardware a.sironi11 June 2008 Padoa, 13 a.sironi 13 © Current Data Warehouse Challenges Total cost of ownership Large amount of data, large storage cost Huge hardware management cost Complex BI queries Planning for reporting queries Build index/MQT in advance Unexpected ad-hoc query performance Hinder interactive data analysis a.sironi11 June 2008 Padoa, 14 a.sironi 14 © Current Data Warehouse Research Goals Dramatic TCO Reduction Extreme compression Ride the wave of commodity hardware Constant ad-hoc query response time Exploit enormous parallelism (multi-core) Exploit large memories (in-memory database) Scan-based query processing a.sironi11 June 2008 Padoa, 15 a.sironi 15 © Blink – a Data Warehouse Accelerator Prototype Research prototype underway by IBM Almaden and Boeblingen labs Achieve consistent response times for ad hoc queries Exploit Modern Hardware Parallelism in Multicore commodity processors Exploit Large Memory (in-memory DB) DBA relief (no database tuning - no indexes, MQT’s etc) Goal Run ad hoc BI queries with consistent response times Target: query 1 Billion tuples in 1 second for $10K worth of 2007 hardware a.sironi11 June 2008 Padoa, 16 a.sironi 16 © Blink – a Data Warehouse Accelerator Prototype Today: Blink Query Engine Most queries in about 3 secs 1 $4000 box, 2x4 cores, 16GB 3 Google-like experience for BI queries! Near optimal compression of relational data Size of query results Exploits data skew, column correlations and lack of ordering Between 8 and 40x compression 100 Querying compressed row-store10 1000 10000 Directly perform projections and selections on compressed data Efficient hash based aggregation Constant query response time 3 sec/billion tuple (Today) a.sironi11 June 2008 Padoa, 17 a.sironi 17 © DW/BI Topology with Data Warehouse Accelerator Operational Data ETL Data Warehouse General Purpose RDBMS e.g. DB2 General Purpose RDBMS e.g. DB2 OLTP Applications BI Applications a.sironi11 June 2008 Padoa, 18 DW Accelerator a.sironi 18 © IBM solidDB Acquisition: Key Capabilities In-memory, relational database solidDB optimizes data also for in-memory access, not only on disk Applications can take advantage of its capability through standard ODBC, JDBC, SQL interfaces. Instant failover solidDB maintains two copies of the data synchronized at all times In case of system failure, applications can recover access to solidDB well under 1 second without loss of data Embeddable solidDB can be deployed in a client/server configuration, or as a library linked into the application process Front-end cache for RDBMS June 2008, available as a front-end cache for IBM DB2 & IDS a.sironi11 June 2008 Padoa, 19 a.sironi 19 © Concluding Remarks Technology advances don’t stop! However, establishing and enforcing methods based on Best Practices will always be the best protection against failures, whatever the technology will be! a.sironi11 June 2008 Padoa, 20 a.sironi 20 © a.sironi11 June 2008 Padoa, 21 a.sironi 21 © References [1] [2] [3] [4] [5] Z. Czech et al., An optimal algorithm for generating minimal perfect hash functions, IPL, 43(5), 1992 V. Raman et al., Constant-time Query Processing, IEEE International Conference of Data Engineering, 2008 V. Raman and Garret Swart, How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations, VLDB ‘06, September 12-15, 2006 Eric Lai, Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest, Computerworld, May 22, 2008 T. Westmann et al., The Implementation and Performance of Compressed Databases, SIGMOD Record, 29(3), 2000 a.sironi11 June 2008 Padoa, 22 a.sironi 22 © Bi-temporal Model and Surrogate Keys: A Ref. Integrity Issue Late A Referential arrival of Dimension IntegrityUpdates issue CDRs transformed and loaded on 21.08.2001 Calling Phone Nr Valid Date Valid Time Trx Date Called Phone Nr Cost Phone ID 02-444-111 19.08.2001 20:23:06 21.08.2001 06-132-4326 120 100 02-444-111 20.08.2001 09:10:45 21.08.2001 0331-239-325 320 100 Phone Dimension State after on 23.08.2001 21.08.2001 Phone ID Phone Nr. Init Valid Date End Valid Date 100 02-444-111 01.01.2000 101 02-444-111 102 02-444-111 a.sironi11 June 2008 Padoa, End Trx Date Cust Key 31.12.9999 01.01.2000 31.12.9999 22.08.2001 C3 01.01.2000 19.08.2001 23.08.2001 31.12.9999 C3 20.08.2001 31.12.9999 23.08.2001 31.12.9999 C6 23 Init Trx Date a.sironi 23 © Where are Referential Integrity issues coming from? Calling Phone Nr Valid Date Valid Time Trx Date Called Phone Nr 02-444-111 19.08.2001 20:23:06 21.08.2001 06-132-4326 120 100 02-444-111 20.08.2001 09:10:45 21.08.2001 0331-239-325 320 100 Phone ID Phone Nr. Init Valid Date End Valid Date 100 02-444-111 01.01.2000 101 02-444-111 102 02-444-111 SELECT FROM WHERE AND AND AND AND Init Trx Date Phone ID End Trx Date Cust Key 31.12.9999 01.01.2000 31.12.9999 22.08.2001 C3 01.01.2000 19.08.2001 23.08.2001 31.12.9999 C3 20.08.2001 31.12.9999 23.08.2001 31.12.9999 C6 P.*P.* CDR C, PHONE P C.CALLING_PHONE_NR = P.PHONE_NR ’22.08.2001’ INIT_TRX_DATE >= ’23.08.2001’ END_TRX_DATE <= < ’22.08.2001’ ’23.08.2001’ VALID_DATE >= INIT_VALID_DATE <= END_VALID_DATE VALID_DATE < a.sironi11 June 2008 Padoa, Cost 24 RETURNS RETURNS PHONE_ID=101 forPHONE_ID=100 FIRST phone call, forPHONE_ID BOTH phone calls = 102 for SECOND phone call a.sironi 24 © Amdahl’s Law (a simplified view) Serial Processing Elapsed Time STEP B STEP A STEP A STEP C STEP C STEP D STEP E Time Savings Sp Parallel Steps a.sironi11 June 2008 Padoa, STEP E 25 1 + (1 - ) p when 0 a.sironi 25 © “Load” Process (Everything Parallel & Automated) Input Records Sort Insert Current Month-6 a.sironi11 June 2008 Padoa, Input Records Input Records Input Records Sort Sort Insert Insert Current Month-1 26 Input Records Input Records Input Records Input Records Unload Sort Sort Load Load Current Day-2 Current Day-1 Current Month a.sironi 26