Download Very Large Databases, Data Wharehousing, and The Use of Space

Very Large Databases And Data Warehousing Copyright, Harris Corporation & Ophir Frieder, 1998 1 Objectives • Learn how space is used within a database. • Learn about different types of database usage patterns. • Define a variety of terms and phrases commonly used in the context of Very Large Databases (VLDBs). • Discuss vendor capabilities with regard to VLDBs. Copyright, Harris Corporation & Ophir Frieder, 1998 2 Space Usage • Consider the following all too typical sentences: “Our system has been successfully demonstrated on database systems as large as 1TB” “We at <XYZ company> have extensive experience with very large databases, i.e., greater than 500GB.” “The <XYZ system> shall be capable of executing insert queries into a database of size at least 10GB within 1.5 seconds.” Copyright, Harris Corporation & Ophir Frieder, 1998 3 Space Usage, Cont. • Such sentences are imprecise and misleading for a variety of reasons. • What do the following sentence fragments mean? “...database systems as large as 1TB.” “...with very large databases, i.e., greater than 500GB.? “...database of size at least 10GB.” Copyright, Harris Corporation & Ophir Frieder, 1998 4 Space Usage, Cont. • When interpreting such numbers, the following must be considered: Platform Size vs. Data Size: It is typical for the supporting platform to have significantly more storage space than that required by the actual data set (5x or more). The extra space is used for a variety of reasons; mirroring, logs, temporary storage, or performance. Index Size: An index on a table may take considerable space, even more than the table, depending on characteristics of the index; number and types of columns, and padding in index pages. Copyright, Harris Corporation & Ophir Frieder, 1998 5 Space Usage, Cont. Use of Extra Space: Padding disk pages can reduce lock contention, thereby improving performance. Joins and sorts typically require “large” temporary storage areas. Redundant Data: De-normalizing (for performance) will typically introduce redundant data. Mirroring data (for integrity) will duplicate data. Copyright, Harris Corporation & Ophir Frieder, 1998 6 Space Usage • Note that all of these are all legitimate uses of space. • However, frequently it is typical to try and get a measure of the “core” data size, which would exclude indices, free space, and redundancy. • Lesson => define your terms, and ask questions! Copyright, Harris Corporation & Ophir Frieder, 1998 7 Database Types & Usage Patterns – – – – – – – On-Line Transaction Processing (OLTP) Decision Support System (DSS) Data Warehousing Data Mart On-Line Analytical Processing (OLAP) Data Mining Very Large Databases (VLDBs) Copyright, Harris Corporation & Ophir Frieder, 1998 8 OLTP & DSS On-Line Transaction Processing: OLTP applications are updateintensive and generally consist of shorter transactions that access a small portion of a database, often through a primary key or index. Decision Support System: DSS applications typically consist of long and often complex read-only queries that access large portions of the database. – Transaction Processing Performance Council Copyright, Harris Corporation & Ophir Frieder, 1998 9 Data Warehousing A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision needs. Data entering a data warehouse comes from an operational environment in almost every case. The data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. In addition, there are four levels of data in the warehouse: old detail, current detail, lightly summarized data, and highly summarized data. – W.H. Inman, the “father” of the data warehousing concept Copyright, Harris Corporation & Ophir Frieder, 1998 10 Data Warehousing, Cont. DSS User DSS User Data Warehouse Integration Source Source Source Copyright, Harris Corporation & Ophir Frieder, 1998 Source 11 Data Warehousing, Cont. • Purpose: – To separate OLTP and DSS operations into different systems, thereby allowing each to be tuned appropriately. • Data Sources: – A data warehouse typically has several sources of data, often times OLTP systems. – Sources do not have to be homogenous (different/same data, different/same formats, etc.) Copyright, Harris Corporation & Ophir Frieder, 1998 12 Data Warehousing, Cont. • Data Integration: – Integration is done periodically, not continuously. – Data is filtered; not all data is used by the DSS analyst. – Data is put into a consistent format. – Data is often times simply appended. – Data is typically given a time-stamp; when the data was integrated, or when it was valid. – Once in the warehouse, data is typically not modified. Copyright, Harris Corporation & Ophir Frieder, 1998 13 Data Warehousing, Cont. • Internal Components: – Current Detail Data: The most recently made copy of data from the operational environment (may in fact be relatively old). – Old Detail Data: As new data is copied to the current detail data segment, the previous contents of the current detail data are copied to this segment. Perhaps on a cheaper medium. – Lightly Summarized Data: In anticipation of queries by DSS analysts, current detail data is summarized (e.g., regional sales by week). – Highly Summarized Data: A further refinement of lightly summarized data (e.g., national sales by month). • Note that these are basically forms of de-normalization. Copyright, Harris Corporation & Ophir Frieder, 1998 14 Data Warehousing, Cont. • Subject Orientation: – Many operational systems are structured around functions or transactions. – A system for bank tellers might be structured around major functions such as opening & closing accounts, granting loans, adding & deleting customers, withdrawing and depositing money, etc. – A system for bank managers might be structured around assets and liabilities; accounts, loans, investments, customers. Copyright, Harris Corporation & Ophir Frieder, 1998 15 Data Mart A data mart is a body of DSS data for a department that has an architectural foundation of a data warehouse. – W.H. Inman Copyright, Harris Corporation & Ophir Frieder, 1998 16 Data Mart, Cont. • Purpose: – To eliminate many of the problems resulting from the size, and from the number of uses of many data warehouses. • Data Sources: – A data warehouse is typically the source of data for a data mart. • Data Integration: – Data is filtered, customized, and refined for use by a specific “department.” Copyright, Harris Corporation & Ophir Frieder, 1998 17 Data Mart, Cont. • Relative To A Data Warehouse, A Data Mart Is Typically: – smaller – has fewer users – easier to manipulate and analyze Copyright, Harris Corporation & Ophir Frieder, 1998 18 On-Line Analytical Processing On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. – The OLAP Council Copyright, Harris Corporation & Ophir Frieder, 1998 19 On-Line Analytical Processing, Cont. • OLAP Functionality Is Characterized By: – Multi-dimensional analysis of data. – Supports end-user analytical and navigational activities. – Calculations and modeling applied across dimensions, through hierarchies and/or across members. – Trend analysis over sequential time periods. – Analysis of historical and projected data in various “what-if” scenarios. – Implies, at least to a certain extent, a large or very large repository that has been configured for DSS applications. Copyright, Harris Corporation & Ophir Frieder, 1998 20 Data Mining Historically, also known as Knowledge Discovery in Databases (KDD), which is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. – Frawley, Piatetsk-Shapiro, and Matheus Copyright, Harris Corporation & Ophir Frieder, 1998 21 Data Mining, Cont. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a pattern as a statement S in L that describes relationships among a subset Fs of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in Fs. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge. – Frawley, Piatetsk-Shapiro, and Matheus Copyright, Harris Corporation & Ophir Frieder, 1998 22 Data Mining, Cont. • Numerous definitions exist, most share the following similarities: – Data is too large to manually analyze; typically data is in a warehouse or similar large repository. – Data is being analyzed in a DSS or OLAP manner. – The goal of data mining is to extract rules or patterns that are easily understood by a person. – Patterns are deduced from data, and then presented to a user for evaluation. – Assumes that specialized tools are used to extract the patterns. Copyright, Harris Corporation & Ophir Frieder, 1998 23 Very Large Databases (VLDBs) • Question: – What constitutes a very large database? – For that matter, what constitutes a small, medium or large database? • Surveys And Studies: – The Gartner Group (‘96) – Transaction Processing Performance Council (TPC) – Winter Corporation Copyright, Harris Corporation & Ophir Frieder, 1998 24 The Gartner Group Small Medium Large Very Large OLTP 0-10 10-50 50-200 200+ Data Warehouse 0-30 30-100 100-500 500+ – Oracle Magazine, November/December 1996 (sizes given in gigabytes). Copyright, Harris Corporation & Ophir Frieder, 1998 25 Winter Corporation Survey - 1998 • OLTP Systems: – IBM S/390 running DB2 dominates in database size, most rows/records, and, to a lesser extend, peak on-line activity for OLTP systems. – The largest OLTP database, which is 16.8TB in total size (11.3TB for data and 5.5TB for indices), belongs to United Parcel Service (UPS). This system resides on an IBM S/390 running DB2. – Compared with other environments, OLTP systems in Unix environments were significantly smaller, the largest being 600GB. Copyright, Harris Corporation & Ophir Frieder, 1998 26 Winter Corporation Survey - 1998, Cont. • DSS Systems: – The largest database, in terms of both total database size and number of rows/records, belongs to The Dialog Corporation. This system resides on Hitachi and Sun platforms running a proprietary database system. The database is 6.3TB in size and has 1.5 trillion rows/records. – For DSS databases, eleven of the top ten largest were in Unix environments. – The database with the top peak on-line activity is owned by JC Penny. The system resides on an NCR 5100M running a Taradata database management system, and executes as many as 784 concurrent queries. Copyright, Harris Corporation & Ophir Frieder, 1998 27 Winter Corporation Survey - 1998, Cont. • No Shows: – There were no reports of a database over 542GB on NT. – None of the top-ten contenders in any category made use of Sybase. – No databases were reported that used object-oriented systems. Copyright, Harris Corporation & Ophir Frieder, 1998 28 Winter Corporation Survey - 1998, Cont. • Oracle & Informix: – Outside of Unix, systems using Oracle and Informix were significantly smaller - they never appeared in the top three of any other category. – Oracle and Informix do much better in all three categories (size, rows/records, and peak activity) for Unix systems. Copyright, Harris Corporation & Ophir Frieder, 1998 29 Transaction Processing Performance Council (TPC) • TPC is an organization whose purpose is to develop and maintain database benchmarking materials. • TPC also enforces a relatively strict process for using their materials. • TPC materials are used and referenced heavily by virtually all major database hardware and software vendors. • Results of the benchmarks are published by TPC, and updated regularly as new results are submitted. • Not surprisingly, TPC benchmark results also appear regularly in numerous vendor press releases. Copyright, Harris Corporation & Ophir Frieder, 1998 30 Transaction Processing Performance Council (TPC) • TPC has developed several different sets of benchmark materials. Two of the most frequently referenced are TPC-C and TPC-D, which are OLTP and DSS benchmarks, respectively. • Currently, results reported for the TPC-D benchmark include databases consisting of 30GB, 100GB, 300GB, 1TB, and 3TB of data. • Historically, TPC has not interpreted or ranked the results. However, such a ranking is published by Ideas International, and can be accessed at www.ideasinternational.com). Copyright, Harris Corporation & Ophir Frieder, 1998 31 Transaction Processing Performance Council (TPC) • TPC Metrics: – Power Metric: A measure of the raw query execution power of the system when connected with a single active user. – Throughput Metric: A measure of the ability of the system to process the most queries in the least amount of time. – Price/Performance Metric: The ratio between total system price and the query-per-hour-rating, which is a composite of the power and throughput metrics. Copyright, Harris Corporation & Ophir Frieder, 1998 32 Highlights From The TPC-D Results As Of 4/21/98 • TPC-D 100GB: – NCR WorldMark 4700 running a Teradata database had the highest ranking on power and throughput. – IBM/6000 running DB2 had the three highest scores in price/performance. – IBM/6000 running DB2 also came in third and fourth, for both power and throughput. – Although it did not capture the top spot in any of the three categories, Oracle, running on a variety of platforms, captured six of the top ten spots on power, five of the top ten spots on throughput, and five of the top ten spots on price/performance. – Informix ranked within the top ten in each category, appearing in the sixth position on power, the ninth position on throughput, and the sixth and tenth positions on price/performance. Copyright, Harris Corporation & Ophir Frieder, 1998 33 Transaction Processing Performance Council (TPC) • TPC-D 300GB: – NCR WorldMark 5150 running a Taradata database had the highest ranking on power and throughput. – DG AViiON AV20000 running Oracle came in first on price/performance. – As with the 100GB category, Oracle appeared very strong in all three categories, capturing six of the top ten spots in power, seven of the top ten spots on throughput, and six of the top ten spots in price/performance. – Informix appeared slightly stronger in the 300GB category, capturing three of the top ten spots in power, throughput, and price/performance. Copyright, Harris Corporation & Ophir Frieder, 1998 34 Transaction Processing Performance Council (TPC) • TPC-D 1TB: – Sun Ultra Enterprise 6000 running an INFORMIX Dynamic server captured the top spot in all three metrics; power, throughput, and price/performance. – Although they did not capture the top spot on any of the three metrics, both NCR WorldMark 5150 running Teradata, and IBM RS/600 running DB2 both appeared within the top four in all three metrics. Copyright, Harris Corporation & Ophir Frieder, 1998 35 Transaction Processing Performance Council (TPC) • Other Notes: – For these TPC-D results, twenty competitors submitted at the 100GB level, twelve submitted at 300GB, and four submitted at 1TB. – Although several entries using Sybase were submitted to the TPC-C benchmark, none were submitted for TPC-D. – No results were submitted, for either TPC-C or TPC-D, using purely object-oriented systems such as ObjectStore. Copyright, Harris Corporation & Ophir Frieder, 1998 36 Space Related Terminology Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zetabyte Yottabyte (KB) (MB) (GB) (TB) (PB) (EB) (ZB) (YB) 2^10 or 1024 bytes 2^20 or 1024 kilobytes 2^30 or 1024 gigabytes 2^40 or 1024 megabytes 2^50 or 1024 terabytes 2^60 or 1024 petabytes 2^70 or 1024 exabytes 2^80 or 1024 zetabytes Copyright, Harris Corporation & Ophir Frieder, 1998 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Very Large Databases, Data Wharehousing, and The Use of Space