Download Very Large Databases, Data Wharehousing, and The Use of Space

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Very Large Databases
And
Data Warehousing
Copyright, Harris Corporation & Ophir Frieder, 1998
1
Objectives
• Learn how space is used within a database.
• Learn about different types of database usage patterns.
• Define a variety of terms and phrases commonly used in the context of
Very Large Databases (VLDBs).
• Discuss vendor capabilities with regard to VLDBs.
Copyright, Harris Corporation & Ophir Frieder, 1998
2
Space Usage
• Consider the following all too typical sentences:
“Our system has been successfully demonstrated on database systems as large
as 1TB”
“We at <XYZ company> have extensive experience with very large databases,
i.e., greater than 500GB.”
“The <XYZ system> shall be capable of executing insert queries into a
database of size at least 10GB within 1.5 seconds.”
Copyright, Harris Corporation & Ophir Frieder, 1998
3
Space Usage, Cont.
• Such sentences are imprecise and misleading for a variety of reasons.
• What do the following sentence fragments mean?
“...database systems as large as 1TB.”
“...with very large databases, i.e., greater than 500GB.?
“...database of size at least 10GB.”
Copyright, Harris Corporation & Ophir Frieder, 1998
4
Space Usage, Cont.
• When interpreting such numbers, the following must be considered:
Platform Size vs. Data Size: It is typical for the supporting platform to have
significantly more storage space than that required by the actual data set (5x or
more). The extra space is used for a variety of reasons; mirroring, logs,
temporary storage, or performance.
Index Size: An index on a table may take considerable space, even more than
the table, depending on characteristics of the index; number and types of
columns, and padding in index pages.
Copyright, Harris Corporation & Ophir Frieder, 1998
5
Space Usage, Cont.
Use of Extra Space: Padding disk pages can reduce lock contention, thereby
improving performance. Joins and sorts typically require “large” temporary
storage areas.
Redundant Data: De-normalizing (for performance) will typically introduce
redundant data. Mirroring data (for integrity) will duplicate data.
Copyright, Harris Corporation & Ophir Frieder, 1998
6
Space Usage
• Note that all of these are all legitimate uses of space.
• However, frequently it is typical to try and get a measure of the “core”
data size, which would exclude indices, free space, and redundancy.
• Lesson => define your terms, and ask questions!
Copyright, Harris Corporation & Ophir Frieder, 1998
7
Database Types
& Usage Patterns
–
–
–
–
–
–
–
On-Line Transaction Processing (OLTP)
Decision Support System (DSS)
Data Warehousing
Data Mart
On-Line Analytical Processing (OLAP)
Data Mining
Very Large Databases (VLDBs)
Copyright, Harris Corporation & Ophir Frieder, 1998
8
OLTP & DSS
On-Line Transaction Processing: OLTP applications are updateintensive and generally consist of shorter transactions that access a small
portion of a database, often through a primary key or index.
Decision Support System: DSS applications typically consist of long and
often complex read-only queries that access large portions of the database.
– Transaction Processing Performance Council
Copyright, Harris Corporation & Ophir Frieder, 1998
9
Data Warehousing
A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile
collection of data in support of management’s decision needs. Data entering a
data warehouse comes from an operational environment in almost every case. The
data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. In addition, there are four
levels of data in the warehouse: old detail, current detail, lightly summarized data,
and highly summarized data.
– W.H. Inman, the “father” of the data warehousing concept
Copyright, Harris Corporation & Ophir Frieder, 1998
10
Data Warehousing, Cont.
DSS User
DSS User
Data
Warehouse
Integration
Source
Source
Source
Copyright, Harris Corporation & Ophir Frieder, 1998
Source
11
Data Warehousing, Cont.
• Purpose:
– To separate OLTP and DSS operations into different systems, thereby
allowing each to be tuned appropriately.
• Data Sources:
– A data warehouse typically has several sources of data, often times OLTP
systems.
– Sources do not have to be homogenous (different/same data,
different/same formats, etc.)
Copyright, Harris Corporation & Ophir Frieder, 1998
12
Data Warehousing, Cont.
• Data Integration:
– Integration is done periodically, not continuously.
– Data is filtered; not all data is used by the DSS analyst.
– Data is put into a consistent format.
– Data is often times simply appended.
– Data is typically given a time-stamp; when the data was integrated, or
when it was valid.
– Once in the warehouse, data is typically not modified.
Copyright, Harris Corporation & Ophir Frieder, 1998
13
Data Warehousing, Cont.
• Internal Components:
– Current Detail Data: The most recently made copy of data from the
operational environment (may in fact be relatively old).
– Old Detail Data: As new data is copied to the current detail data segment,
the previous contents of the current detail data are copied to this segment.
Perhaps on a cheaper medium.
– Lightly Summarized Data: In anticipation of queries by DSS analysts,
current detail data is summarized (e.g., regional sales by week).
– Highly Summarized Data: A further refinement of lightly summarized
data (e.g., national sales by month).
•
Note that these are basically forms of de-normalization.
Copyright, Harris Corporation & Ophir Frieder, 1998
14
Data Warehousing, Cont.
• Subject Orientation:
– Many operational systems are structured around functions or transactions.
– A system for bank tellers might be structured around major functions such
as opening & closing accounts, granting loans, adding & deleting
customers, withdrawing and depositing money, etc.
– A system for bank managers might be structured around assets and
liabilities; accounts, loans, investments, customers.
Copyright, Harris Corporation & Ophir Frieder, 1998
15
Data Mart
A data mart is a body of DSS data for a department that has an
architectural foundation of a data warehouse.
– W.H. Inman
Copyright, Harris Corporation & Ophir Frieder, 1998
16
Data Mart, Cont.
• Purpose:
– To eliminate many of the problems resulting from the size, and from the
number of uses of many data warehouses.
• Data Sources:
– A data warehouse is typically the source of data for a data mart.
• Data Integration:
– Data is filtered, customized, and refined for use by a specific
“department.”
Copyright, Harris Corporation & Ophir Frieder, 1998
17
Data Mart, Cont.
• Relative To A Data Warehouse, A Data Mart Is Typically:
– smaller
– has fewer users
– easier to manipulate and analyze
Copyright, Harris Corporation & Ophir Frieder, 1998
18
On-Line Analytical Processing
On-Line Analytical Processing (OLAP) is a category of software
technology that enables analysts, managers and executives to gain
insight into data through fast, consistent, interactive access to a wide
variety of possible views of information that has been transformed from
raw data to reflect the real dimensionality of the enterprise as
understood by the user.
– The OLAP Council
Copyright, Harris Corporation & Ophir Frieder, 1998
19
On-Line Analytical
Processing, Cont.
• OLAP Functionality Is Characterized By:
– Multi-dimensional analysis of data.
– Supports end-user analytical and navigational activities.
– Calculations and modeling applied across dimensions, through hierarchies
and/or across members.
– Trend analysis over sequential time periods.
– Analysis of historical and projected data in various “what-if” scenarios.
– Implies, at least to a certain extent, a large or very large repository that
has been configured for DSS applications.
Copyright, Harris Corporation & Ophir Frieder, 1998
20
Data Mining
Historically, also known as Knowledge Discovery in Databases
(KDD), which is defined as the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data.
– Frawley, Piatetsk-Shapiro, and Matheus
Copyright, Harris Corporation & Ophir Frieder, 1998
21
Data Mining, Cont.
Given a set of facts (data) F, a language L, and some measure of certainty C,
we define a pattern as a statement S in L that describes relationships among a
subset Fs of F with a certainty c, such that S is simpler (in some sense) than
the enumeration of all facts in Fs. A pattern that is interesting (according to a
user-imposed interest measure) and certain enough (again according to the
user’s criteria) is called knowledge. The output of a program that monitors the
set of facts in a database and produces patterns in this sense is discovered
knowledge.
– Frawley, Piatetsk-Shapiro, and Matheus
Copyright, Harris Corporation & Ophir Frieder, 1998
22
Data Mining, Cont.
• Numerous definitions exist, most share the following similarities:
– Data is too large to manually analyze; typically data is in a warehouse or
similar large repository.
– Data is being analyzed in a DSS or OLAP manner.
– The goal of data mining is to extract rules or patterns that are easily
understood by a person.
– Patterns are deduced from data, and then presented to a user for
evaluation.
– Assumes that specialized tools are used to extract the patterns.
Copyright, Harris Corporation & Ophir Frieder, 1998
23
Very Large Databases
(VLDBs)
• Question:
– What constitutes a very large database?
– For that matter, what constitutes a small, medium or large database?
• Surveys And Studies:
– The Gartner Group (‘96)
– Transaction Processing Performance Council (TPC)
– Winter Corporation
Copyright, Harris Corporation & Ophir Frieder, 1998
24
The Gartner Group
Small
Medium
Large
Very Large
OLTP
0-10
10-50
50-200
200+
Data Warehouse
0-30
30-100
100-500
500+
– Oracle Magazine, November/December 1996 (sizes given
in gigabytes).
Copyright, Harris Corporation & Ophir Frieder, 1998
25
Winter Corporation
Survey - 1998
• OLTP Systems:
– IBM S/390 running DB2 dominates in database size, most rows/records,
and, to a lesser extend, peak on-line activity for OLTP systems.
– The largest OLTP database, which is 16.8TB in total size (11.3TB for data
and 5.5TB for indices), belongs to United Parcel Service (UPS). This
system resides on an IBM S/390 running DB2.
– Compared with other environments, OLTP systems in Unix environments
were significantly smaller, the largest being 600GB.
Copyright, Harris Corporation & Ophir Frieder, 1998
26
Winter Corporation
Survey - 1998, Cont.
• DSS Systems:
– The largest database, in terms of both total database size and number of
rows/records, belongs to The Dialog Corporation. This system resides on
Hitachi and Sun platforms running a proprietary database system. The
database is 6.3TB in size and has 1.5 trillion rows/records.
– For DSS databases, eleven of the top ten largest were in Unix
environments.
– The database with the top peak on-line activity is owned by JC Penny.
The system resides on an NCR 5100M running a Taradata database
management system, and executes as many as 784 concurrent queries.
Copyright, Harris Corporation & Ophir Frieder, 1998
27
Winter Corporation
Survey - 1998, Cont.
• No Shows:
– There were no reports of a database over 542GB on NT.
– None of the top-ten contenders in any category made use of Sybase.
– No databases were reported that used object-oriented systems.
Copyright, Harris Corporation & Ophir Frieder, 1998
28
Winter Corporation
Survey - 1998, Cont.
• Oracle & Informix:
– Outside of Unix, systems using Oracle and Informix were significantly
smaller - they never appeared in the top three of any other category.
– Oracle and Informix do much better in all three categories (size,
rows/records, and peak activity) for Unix systems.
Copyright, Harris Corporation & Ophir Frieder, 1998
29
Transaction Processing
Performance Council (TPC)
•
TPC is an organization whose purpose is to develop and maintain database
benchmarking materials.
•
TPC also enforces a relatively strict process for using their materials.
•
TPC materials are used and referenced heavily by virtually all major database
hardware and software vendors.
•
Results of the benchmarks are published by TPC, and updated regularly as
new results are submitted.
•
Not surprisingly, TPC benchmark results also appear regularly in numerous
vendor press releases.
Copyright, Harris Corporation & Ophir Frieder, 1998
30
Transaction Processing
Performance Council (TPC)
•
TPC has developed several different sets of benchmark materials. Two of the
most frequently referenced are TPC-C and TPC-D, which are OLTP and DSS
benchmarks, respectively.
•
Currently, results reported for the TPC-D benchmark include databases
consisting of 30GB, 100GB, 300GB, 1TB, and 3TB of data.
•
Historically, TPC has not interpreted or ranked the results. However, such a
ranking is published by Ideas International, and can be accessed at
www.ideasinternational.com).
Copyright, Harris Corporation & Ophir Frieder, 1998
31
Transaction Processing
Performance Council (TPC)
•
TPC Metrics:
– Power Metric: A measure of the raw query execution power of the system
when connected with a single active user.
– Throughput Metric: A measure of the ability of the system to process the
most queries in the least amount of time.
– Price/Performance Metric: The ratio between total system price and the
query-per-hour-rating, which is a composite of the power and throughput
metrics.
Copyright, Harris Corporation & Ophir Frieder, 1998
32
Highlights From The
TPC-D Results As Of 4/21/98
•
TPC-D 100GB:
– NCR WorldMark 4700 running a Teradata database had the highest
ranking on power and throughput.
– IBM/6000 running DB2 had the three highest scores in price/performance.
– IBM/6000 running DB2 also came in third and fourth, for both power and
throughput.
– Although it did not capture the top spot in any of the three categories,
Oracle, running on a variety of platforms, captured six of the top ten spots
on power, five of the top ten spots on throughput, and five of the top ten
spots on price/performance.
– Informix ranked within the top ten in each category, appearing in the sixth
position on power, the ninth position on throughput, and the sixth and
tenth positions on price/performance.
Copyright, Harris Corporation & Ophir Frieder, 1998
33
Transaction Processing
Performance Council (TPC)
•
TPC-D 300GB:
– NCR WorldMark 5150 running a Taradata database had the highest
ranking on power and throughput.
– DG AViiON AV20000 running Oracle came in first on price/performance.
– As with the 100GB category, Oracle appeared very strong in all three
categories, capturing six of the top ten spots in power, seven of the top ten
spots on throughput, and six of the top ten spots in price/performance.
– Informix appeared slightly stronger in the 300GB category, capturing
three of the top ten spots in power, throughput, and price/performance.
Copyright, Harris Corporation & Ophir Frieder, 1998
34
Transaction Processing
Performance Council (TPC)
•
TPC-D 1TB:
– Sun Ultra Enterprise 6000 running an INFORMIX Dynamic server
captured the top spot in all three metrics; power, throughput, and
price/performance.
– Although they did not capture the top spot on any of the three metrics,
both NCR WorldMark 5150 running Teradata, and IBM RS/600 running
DB2 both appeared within the top four in all three metrics.
Copyright, Harris Corporation & Ophir Frieder, 1998
35
Transaction Processing
Performance Council (TPC)
•
Other Notes:
– For these TPC-D results, twenty competitors submitted at the 100GB
level, twelve submitted at 300GB, and four submitted at 1TB.
– Although several entries using Sybase were submitted to the TPC-C
benchmark, none were submitted for TPC-D.
– No results were submitted, for either TPC-C or TPC-D, using purely
object-oriented systems such as ObjectStore.
Copyright, Harris Corporation & Ophir Frieder, 1998
36
Space Related Terminology
Kilobyte
Megabyte
Gigabyte
Terabyte
Petabyte
Exabyte
Zetabyte
Yottabyte
(KB)
(MB)
(GB)
(TB)
(PB)
(EB)
(ZB)
(YB)
2^10 or 1024 bytes
2^20 or 1024 kilobytes
2^30 or 1024 gigabytes
2^40 or 1024 megabytes
2^50 or 1024 terabytes
2^60 or 1024 petabytes
2^70 or 1024 exabytes
2^80 or 1024 zetabytes
Copyright, Harris Corporation & Ophir Frieder, 1998
37