Download Data Warehousing Fundamentals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Warehousing
Fundamentals
Volume 2 • Student Guide
.......................................................................................
50102GC20
Production 2.0
May 1999
M08762
Authors
Copyright  Oracle Corporation, 1999. All rights reserved.
Chon S. Chua
This documentation contains proprietary information of Oracle Corporation. It is
provided under a license agreement containing restrictions on use and disclosure
and is also protected by copyright law. Reverse engineering of the software is
prohibited. If this documentation is delivered to a U.S. Government Agency of the
Department of Defense, then it is delivered with Restricted Rights and the
following legend is applicable:
Richard Green
Technical Contributors
and Reviewers
Jackie Collins
Restricted Rights Legend
Jennifer Jacoby
Use, duplication or disclosure by the Government is subject to restrictions for
commercial computer software and shall be deemed to be Restricted Rights
software under Federal law, as set forth in subparagraph (c) (1) (ii) of DFARS
252.227-7013, Rights in Technical Data and Computer Software (October 1988).
Mike Schmitz
John Haydu
Russ Pitts
Lauran Serhal
Brian Pottle
Donna Corrigan
Patricia Moll
Harry Penbert
SuiWah Chan
Joel Barkin
Steve Dressler
Publisher
Tony McGettigan
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other
copying is a violation of copyright law and may result in civil and/or criminal
penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with “Restricted Rights,” as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find
any problems in the documentation, please report them in writing to Education
Products, Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores,
CA 94065. Oracle Corporation does not warrant that this document is error-free.
Data Warehouse Method—A Methodology for Designing Data Warehouse,
SQL*Loader, PL/SQL, Pro*C, Oracle7, Oracle8, and Oracle8i, Distributed Option,
Parallel Query Option, Parallel Server Option, Media Server, Spatial Data Option,
ConText Option, Video Server, Text Server, WebServer, Oracle Universal Server
ROLAP Option, Express Server, Web-enabled Express Server, SQL*Net,
Developer/2000, Relational Access Manager, Discoverer, Designer/2000,
SQL*Bridge, Transparent Gateway Developer’s Kit, Procedural Gateway
Developer’s Kit, Express, Express Analyzer, Express Objects, Sales Analyzer,
and Financial Analyzer are product names, trademarks, or registered trademarks
of Oracle Corporation.
All other products or company names are used for identification purposes only
and may be trademarks of their respective owners.
Contents
.....................................................................................................................................................
Preface
Profile xi
Related Publications xiv
Typographic Conventions xv
Lesson 1: Introduction
Course Objectives 1-3
Agenda 1-5
Questions About You 1-9
Lesson 2: Meeting a Business Need
Overview 2-3
Unsuitability of OLTP Systems for Complex Analysis 2-5
Management Information Systems and Decision Support 2-7
Data Extract Processing 2-9
Business Drivers for Data Warehouses 2-15
Current Situation and Growth of Data Warehousing 2-19
Typical Uses of a Data Warehouse 2-21
Summary 2-23
Practice 2-1 2-25
Lesson 3: Defining Data Warehouse Concepts and Terminology
Overview 3-3
Data Warehouse Definition 3-5
Data Warehouse Properties 3-7
Data Warehouse Terminology 3-21
Components of a Data Warehouse 3-25
Oracle Warehouse Vision, Products, and Services 3-31
Summary 3-41
Practice 3-1 3-43
Lesson 4: Driving Implementation Through a Methodology
Overview 4-3
Warehouse Development Approaches 4-5
The Need for an Iterative and Incremental Methodology 4-13
.....................................................................................................................................................
Data Warehousing Fundamentals
iii
Contents
.....................................................................................................................................................
Oracle Data Warehouse Method 4-15
DWM Fundamental Elements 4-19
Oracle Warehouse Technology Initiative (WTI) 4-57
Summary 4-61
Practice 4-1 4-63
Lesson 5: Planning for a Successful Warehouse
Overview 5-3
Managing Financial Issues 5-5
Obtaining Business Commitment 5-9
Managing a Warehouse Project 5-15
Identifying Planning Phases 5-29
Identifying Warehouse Strategy Phase Deliverables 5-31
Identifying Project Scope Phase Deliverables 5-35
Summary 5-41
Practice 5-1 5-43
Lesson 6: Analyzing User Query Needs
Overview 6-3
Types of Users 6-5
Gathering User Requirements 6-7
Managing User Data Access 6-9
Security 6-21
OLAP 6-25
Query Access Architectures 6-47
Summary 6-51
Practice 6-1 6-53
Lesson 7: Modeling the Data Warehouse
Overview 7-3
Data Warehouse Database Design Phases 7-5
Phase One: Defining the Business Model 7-7
Phase Two: Creating the Dimensional Model 7-17
Data Modeling Tools 7-39
.....................................................................................................................................................
iv
Data Warehousing Fundamentals
Contents
.....................................................................................................................................................
Summary 7-41
Practice 7-1 7-43
Lesson 8: Choosing a Computing Architecture
Overview 8-3
Architecture Requirements 8-5
The Hardware Architecture 8-7
Database Server Requirements 8-29
Parallel Processing 8-33
Summary 8-39
Practice 8-1 8-41
Lesson 9: Planning Warehouse Storage
Overview 9-3
The Server Data Architecture 9-5
Protecting the Database 9-17
Summary 9-27
Practice 9-1 9-29
Lesson 10: Building the Warehouse
Overview 10-3
Extracting, Transforming, and Transporting Data 10-5
Extracting Data 10-13
Examining Data Sources 10-15
Extraction Techniques 10-23
Extraction Tools 10-35
Summary 10-39
Practice 10-1 10-41
Lesson 11: Transforming Data
Overview 11-3
Importance of Data Quality 11-5
Transformation 11-13
Transforming Data: Problems and Solutions 11-17
Transformation Techniques 11-33
.....................................................................................................................................................
Data Warehousing Fundamentals
v
Contents
.....................................................................................................................................................
Transformation Tools
Summary 11-57
Practice 11-1 11-59
11-53
Lesson 12: Transportation: Loading Warehouse Data
Overview 12-3
Transporting Data into the Warehouse 12-5
Building the Transportation Process 12-11
Transporting the Data 12-15
Postprocessing of Loaded Data 12-25
Summary 12-39
Practice 12-1 12-41
Lesson 13: Transportation: Refreshing Warehouse Data
Overview 13-3
Capturing Changed Data 13-5
Limitations of Methods for Applying Changes 13-25
Purging and Archiving Data 13-33
Final Tasks 13-39
Selecting ETT Tools 13-43
Summary 13-51
Practice 13-1 13-53
Lesson 14: Leaving a Metadata Trail
Overview 14-3
Defining Warehouse Metadata 14-5
Developing a Metadata Strategy 14-11
Examining Types of Metadata 14-19
Metadata Management Tools 14-33
Common Warehouse Metadata 14-35
Summary 14-37
Practice 14-1 14-39
Lesson 15: Supporting End-User Access
Overview 15-3
.....................................................................................................................................................
vi
Data Warehousing Fundamentals
Contents
.....................................................................................................................................................
Business Intelligence 15-5
Multidimensional Query Techniques 15-7
Categories of Business Intelligence Tools 15-9
Data Mining in a Warehouse Environment 15-19
Oracle Data Mining Partners 15-33
Summary 15-35
Practice 15-1 15-37
Lesson 16: Web-Enabling the Warehouse
Overview 16-3
Accessing the Warehouse Over the Web 16-5
Common Web Data Warehouse Architecture 16-9
Issues in Deploying a Data Warehouse on the Web 16-11
Evaluating Web-Based Tools 16-19
Summary 16-23
Practice 16-1 16-25
Lesson 17: Managing the Data Warehouse
Overview 17-3
Managing the Transition to Production 17-5
Managing Growth 17-19
Managing Backup and Recovery 17-33
Identifying Data Warehouse Performance Issues
Summary 17-51
17-45
Appendix A: Practice Solutions
Practice 2-1 A-2
Practice 3-1 A-4
Practice 4-1 A-7
Practice 5-1 A-11
Practice 6-1 A-12
Practice 7-1 A-13
Practice 8-1 A-14
Practice 9-1 A-15
.....................................................................................................................................................
Data Warehousing Fundamentals
vii
Contents
.....................................................................................................................................................
Practice 10-1
Practice 11-1
Practice 12-1
Practice 13-1
Practice 14-1
Practice 15-1
Practice 16-1
A-18
A-20
A-21
A-23
A-24
A-26
A-28
Glossary
.....................................................................................................................................................
viii
Data Warehousing Fundamentals
10
.................................
Building the Warehouse
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Choosing a
Computing
Architecture
Meeting a
Business
Need
Modeling
the Data
Warehouse
Analyzing
User Query
Needs
Planning
Warehouse
Storage
ETT
ETT
(Building
(Building
the
the
Warehouse)
Warehouse)
Managing
the Data
Warehouse
Supporting
End User
Access
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
Outline the extraction, transformation, and
transportation processes for building a data
warehouse
•
•
•
•
Identify extraction issues
Explain how to examine data sources
Identify extraction techniques
List tools that can be used to extract data from
sources
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
In this lesson, you explore the sources of data for the data warehouse data. You
consider how the extraction and transformation processes take data from source
systems and change it into data that is acceptable to the users of the data warehouse.
The lesson also describes typical data anomalies and looks at ways to eliminate them.
Note that the “ETT (Building the Warehouse)” block is highlighted in the overview
slide on the facing page.
Objectives
After completing this lesson, you should be able to do the following:
• Outline the extraction, transformation, and transportation processes for building a
data warehouse.
• Identify extraction issues.
• Explain how to examine data sources.
• Identify extraction techniques.
• List tools that can be used to extract data from sources.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-3
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Extraction/Transformation/Transportation
Processes (ETT)
•
•
•
Extract source data
Transform/clean data
Index and summarize
Browser:
Cu
http://
s
Hollywoo d tom
Hollywoo d
Detect changes
Refresh data
a recoro
f
as
http://
+
Load data into WH
ETT
er+s X
:
Browser:
X
•
•
•
Cu
st
Browser:
http:// om er+ X
s:
Hol lywood
Programs
Gateways
Operational
systems
Tools
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-4
Data Warehousing Fundamentals
Extracting, Transforming, and Transporting Data
.....................................................................................................................................................
Extracting, Transforming, and Transporting Data
Extraction, Transformation, and Transportation Tasks
Before considering this lesson’s focus on extraction, you should be aware that
extraction, transformation, and transportation (sometimes called ETT) describes the
series of processes that:
• Extract data from source systems
• Transform and clean up the data
• Index the data
• Summarize the data
• Load data into the warehouse
• Detect the changes made to source data required for the warehouse
• Restructure keys
• Maintain the metadata
• Refresh the warehouse with updated data
You can use custom programming, gateways between database systems, and internally
developed tools or vendor tools to carry out the ETT processes.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-5
Lesson 10: Building the Warehouse
.....................................................................................................................................................
ETT Processes
•
Must result in data that is relevant, useful, highquality, accurate, and accessible
•
Require a large proportion of warehouse
development time and resources
ETT
Relevant
Useful
Clean up
Quality
Consolidate
Operational
systems
Restructure
Warehouse
Accurate
Accessible
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-6
Data Warehousing Fundamentals
Extracting, Transforming, and Transporting Data
.....................................................................................................................................................
ETT Processes
ETT Importance The extraction, transformation, and transportation processes are
absolutely fundamental in ensuring that the data resident in the warehouse is:
• Relevant and useful to the business users
• High quality
• Accurate
• Easy to access so that the warehouse is used efficiently and effectively by the
business users
ETT Cost Building the ETT process is potentially one of the biggest tasks of
building a warehouse; it is complex and time-consuming. In some implementations, it
can take more than half of the total warehouse implementation effort.
Note: Extraction is covered by this lesson; transformation and transportation are
considered in the next two lessons.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-7
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Data Staging Area
•
•
•
The construction site for the warehouse
•
Frequently configured as multitier staging
Required by most implementations
Composed of ODS, flat files, or relational server
tables
Operational
system
Extract
Data
staging
area
Transport
(Load)
Transform
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-8
Data Warehousing Fundamentals
Extracting, Transforming, and Transporting Data
.....................................................................................................................................................
The Data Staging Area
Ralph Kimball is one of the most widely recognized experts in the field of data
warehousing. Kimball calls the data staging area the construction site for the
warehouse. This is where much of the data transformation and cleansing takes place.
A staging area is a typical requirement of warehouse implementations. It may be an
operational data store environment, a set of flat files, a series of tables in a relational
database server, or proprietary data structures used by data staging tools.
You may employ multitier staging that reconciles data before and after the
transformation process and before data is loaded into the warehouse. As many as three
tiers are possible, from the operational server to the staging area and then to the
warehouse server.
Note: Some ETT tools stage data internally and do not require a separate staging area.
If you are using the Oracle server and in-house developed tools, data is typically
transformed after it is bulk-loaded (using SQL*Loader) into the staging area—the
database tables. PL/SQL is often used to transform the data. You may also use
gateways and replication techniques.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-9
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Remote Staging Model
Data staging area within the warehouse environment
Warehouse environment
Oper. envt.
Operational
system
Extract,
transform,
transport
Data
staging
area
Transport
Transform (Load)
Warehouse
Data staging area in its own environment, avoiding
negative impact on the warehouse environment
Staging envt.
Oper. envt.
Operational
system
Warehouse envt.
Data
staging
area
Transport
Extract,
(Load)
transform, Transform
transport
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
Onsite Staging Model
Data staging area within the operational environment,
possibly affecting the operational system
WH envt.
Operational environment
Operational
system
Extract
Data
staging
area
Transform
Transport
(Load)
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-10
Data Warehousing Fundamentals
Extracting, Transforming, and Transporting Data
.....................................................................................................................................................
Possible Staging Models
Choosing a Model The model you choose depends upon operational and warehouse
requirements, system availability, connectivity bandwidth, gateway access, and
volume of data to be moved or transformed.
Remote Staging Model You may choose to extract the data from the operational
environment and transport it into the warehouse environment for transformation
processing. You may optionally execute some transformation processing during the
extraction and transportation from operational to warehouse environment. You would
then execute the bulk of transformation processing in the warehouse environment’s
staging area.
On-site Staging Model Alternatively, you may choose to perform the cleansing,
transformation, and summarization processes locally in the operational environment
and then extract to the staging area. This model may conflict with the day-to-day
working of the operational system. If chosen, this model’s process should be executed
when the operational system is idle or less heavily used.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-11
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Extracting Data
Data
mapping
Browser:
Cus
http://
Browser:
http://
X
+
Hollywood
+X
ers
:
Cus
Browser:
tom
http://
Hollywood
a recoro
f
as
Hollywood tom
+X
ers
:
Transform
Operational
databases
•
•
•
Warehouse
database
Data
staging
area
Routines developed to select fields from source
Various data formats
Rules, audit trails, error correction facilities
Copyright  Oracle Corporation, 1999. All rights reserved.
Source Systems
Browser:
http://
Cus
Hol lywood
X
http://
X
+
Hollywood
Browser: Cust
http://
Hollywood
om
e
rs:+
X
as
Browser:
Archive
f
Production
a recoro
•
•
•
•
tom
ers +
:
Internal
External
12345.00
12780.00
2345787.00
87877.98
5678.00
100%
110%
230%
200%
-10%
ABC CO
GMBH LTD
GBUK INC
FFR ASSOC
MCD CO
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-12
Data Warehousing Fundamentals
Extracting Data
.....................................................................................................................................................
Extracting Data
The process of data extraction takes selected data fields that pertain to the subject area
maintained by the data warehouse. The data may come from a variety of source
systems, and the data may exist in a variety of formats.
The extraction routines are developed to account for the variety of systems from which
data is taken. These routines contain data or business rules, as well as audit trails and
error correction facilities.
Source Systems The source systems mentioned may be in the form of data
existing in:
• Production operational systems
• Archives
• Internal files not directly associated with company operational systems, such as
individual spreadsheets and workbooks
• External data from outside the company
Extraction Routines The routines created for extraction are specifically developed
to account for the variety of systems from which data is taken. The routines contain
data or business rules, audit trails, and error correction facilities. The routines take into
account the frequency with which data is to be extracted.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-13
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Production Data
Browser:
http://Cu
st
Hollywood
IMS
Browser:
http://
X
+
Hollywood
om
ers+
:
X
C
Browser:usto
http://
m er +
s:
SAP
X
Hollywood
orof
as
DB2
a rec
Shared Medical
Systems
VSAM
Dun and Bradstreet
Financials
NonStop SQL
Oracle
Hogan Financials
Sybase
Oracle Financials
Rdb
•
•
•
•
Operating system platforms
Hardware platforms
File systems
Database systems and vertical applications
Copyright  Oracle Corporation, 1999. All rights reserved.
Archive Data
Operational
databases
•
•
•
•
Warehouse
database
Historical data
Useful for analysis over long periods of time
Useful for first-time load
May require unique transformations
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-14
Data Warehousing Fundamentals
Examining Data Sources
.....................................................................................................................................................
Examining Data Sources
Production Data Production data may come from a multitude of different sources:
• Operating system platforms
• Hardware platforms
• File systems (flat files)
• Database systems, for example, Oracle, DB2, dBase, Informix, ISAM, NonStop
SQL, Rdb, and TurboImage
• Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun
and Bradstreet
Archive Data Archive data may be useful to the enterprise in supplying historical
data. Historical data is needed if analysis over long periods of time is to be achieved.
Archive data is not used consistently as a source for the warehouse; for example, it
would not be used for regular data refreshes. However, for the initial implementation
of a data warehouse (and the first-time load), archived data is an important source of
historical data.
You need to consider this carefully when planning the data warehouse. How much
historical data do you have available for the data warehouse? How much effort is
necessary to transform it into an acceptable format?
The data warehouse may need some careful and unique transformations, and clear
details of the changes must be maintained in metadata.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-15
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Internal Data
Planning
Marketing
Accounting
•
•
ABC CO
12345.00
100%
12780.00
110%
2345787.00
230%
GBUK INC
87877.98
200%
FFR ASSOC
5678.00
-10%
GMBH LTD
MCD CO
12345.00
100%
12780.00
110%
2345787.00
230%
GBUK INC
87877.98
200%
FFR ASSOC
ABC CO
5678.00
-10%
GMBH LTD
MCD CO
ABC CO
12345.00
100%
12780.00
110%
2345787.00
230%
GBUK INC
87877.98
200%
FFR ASSOC
5678.00
-10%
MCD CO
GMBH LTD
Warehouse
database
Planning, sales, and marketing organization data
Maintained by:
– Spreadsheets (structured)
– Documents (unstructured)
•
Treated like any other source data
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-16
Data Warehousing Fundamentals
Examining Data Sources
.....................................................................................................................................................
Internal Data
Internal data may be information prepared by planning, sales, or marketing
organizations that contains data such as budgets, forecasts, or sales quotas. The data
contains figures (numbers) that are used across the enterprise for comparison
purposes. The data is maintained using software packages such as spreadsheets and
word processors and uploaded into the warehouse.
Internal data is treated like any other source system data. It must be transformed,
documented in metadata, and mapped between the source and target databases.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-17
Lesson 10: Building the Warehouse
.....................................................................................................................................................
External Data
A.C. Nielsen, IRI, IMS,
Walsh America
Purchased
Competitive
databases
information
Economic
forecasts
Dun and
Bradstreet
Barron’s
•
•
•
Warehousing
databases
Wall Street
Journal
Information from outside the organization
Issues of frequency, format, and predictability
Described and tracked using metadata
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-18
Data Warehousing Fundamentals
Examining Data Sources
.....................................................................................................................................................
External Data
External data is important if you want to compare the performance of your business
against others. There are many sources for external data:
• Periodicals and reports
• External syndicated data feeds (Some warehouses rely regularly on this as a
source)
• Competitive analysis information
• Newspapers
• Purchased marketing, competitive, and customer related data
• Free data from the Web
Issues You must consider the following issues with external data:
• Frequency: There is no real pattern like that of internal data. Constant monitoring
is required to determine when it is available.
• Format: The data may be different in format than internal data, and the granularity
of the data may be an issue. In order to make it useful to the warehouse a certain
amount of reformatting may be required. In addition, you may find that external
data, particularly that available on the Web, comes with digital audio data, picture
image data, and digital video data. These present an interesting challenge in
storage and speed of access.
• Predictability: External data is not predictable; it can come from any source at any
time, in any format, on any medium.
Tracked Using Metadata Metadata (described earlier as descriptive data about
data) plays an invaluable role in the registration, access, and control of external data.
The metadata should provide the warehouse manager with as much information about
the external data as possible, averting the need to examine the data closely.
Note: ETT decisions and strategies can evolve over time throughout the life of the
warehouse. It may be prudent to track those strategies and decisions, so that you can
always explain the algorithmic logic or business rules used at different times with
current, recent, or archived data.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-19
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Mapping
•
•
Defines which operational attributes to use
•
•
Defines where the attributes exist in the warehouse
Defines how to transform the attributes for the
warehouse
Mapping tools are available
Metadata
File A
F1
F2
F3
File A
F1
F2
F3
123
Bloggs
10/12/56
Staging File One
Number
Name
DOB
Staging File One
Number USA123
Name Mr. Bloggs
DOB
10-Dec-56
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-20
Data Warehousing Fundamentals
Examining Data Sources
.....................................................................................................................................................
Mapping Data
Once you have determined your business subjects for the warehouse, you need to
determine the required attributes from the source systems.
On an attribute-by-attribute basis you must determine how the source data maps into
the data warehouse, and what, if any, transformation rules to apply. This is known as
mapping. There are mapping tools available.
Mapping information should be maintained in metadata that is server (RDBMS)
resident, for ease of access, maintenance, and clarity.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-21
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Extraction Techniques
•
•
•
•
Programs: C, COBOL, PL/SQL
Gateways: transparent database access
In-house development is popular
Tools
– High initial cost
– Ongoing automation
– Data cleanup
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-22
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Extraction Techniques
You can extract data from different source systems to the warehouse in different ways:
• Programmatically, using procedural languages such as COBOL, C, C++, or
Procedural SQL
• Using a gateway to access data sources. This method is acceptable only for small
amounts of data; otherwise, the network traffic becomes unacceptably high.
• In-house developed tools that:
– Store a physical definition of the source and warehouse data
– Create data dictionaries
– Generate data conversion programs
– Clean and transform the data
– Allow selective retrieval
– Maintain metadata
Note: In-house development is an ongoing activity that may become a resources black
hole. You need local knowledge to support all of the file formats.
• Using a vendor’s data extraction tool
Although it is expensive, an extraction tool:
– Provides ongoing automation of the data extraction process
– Supports data cleanup
More than 50% of companies use their own in-house development teams to develop
data extraction programs. The extraction process may access different host systems
media, such as fiche, optical, tape, CD, and disk formats.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-23
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Sources and Targets
Sources
ODS
Warehouse
Access
Data marts
Data analysis
Data mining
OLAP
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-24
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Sources and Targets
To summarize, the data for the warehouse is a complex mixture of structured and
unstructured data from different source systems. It all needs to be moved in a clean
and integrated state into the warehouse.
Note: The same process is performed for current data that is to reside in an operational
data store.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-25
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Designing Extraction Processes
•
Analysis:
– Sources, technologies
– Data types, quality, owners
•
Design options:
– Manual, custom, gateway, third-party
– Replication, full, or delta refresh
•
Design issues:
– Batch window, volumes, data currency
– Automation, skills needed, resources
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-26
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Designing Extraction Processes
When designing your extraction processes, consider the analysis issues, the design
options available to you, and the design issues.
Analysis
• The sources and technologies used
• Existing data feeds and redo logs
• Data types (EBCDIC or ASCII)
• Data quality and ownership
• Data volumes
• Operational schedule in the source environment
• Spare processing capacity in the source environment
Design Options
• Manual data entry
• Custom programs
• Gateway technologies
• Replication techniques
• Third party tools
• Full refresh or delta changes
Design Issues
• Batch window
• Data volumes
• Data currency (how up-to-date the data is to be)
• Degree of automation required
• Technology skills needed
• Time and money available
.....................................................................................................................................................
Data Warehousing Fundamentals
10-27
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Maintaining Extraction Metadata
•
•
•
•
•
•
•
Source location, type, structure
Access method
Privilege information
Temporary storage
Failure procedures
Validity checks
Handlers for missing data
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-28
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Maintaining Extraction Metadata
It is essential to maintain a “metadata trail” of information about all ETT processes,
including the extraction process. This information is important for warehouse
enhancement and performance improvements.
The quality of metadata is critical for every aspect of the warehouse; attention must be
paid to its control, management, and change.
Extraction metadata includes:
• The source location, type, contact, and structure information
• The access method
• The privilege information
• The extraction temporary storage information
• The extraction failure and validity check procedures information
• Information about how to handle missing data
Extraction metadata also contains information about the frequency of program
execution and maps the source data to the target database.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-29
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Possible ETT Failures
•
•
•
•
•
•
•
•
A missing source file
A system failure
Inadequate metadata
Poor mapping information
Inadequate storage planning
A source structural change
No contingency plan
Inadequate data validation
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-30
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Possible ETT Failures
ETT processes are vital to the warehouse, and they must succeed. ETT may fail for
any of the following reasons:
• Extraction routines must specify the name and location of the source data. A
missing file may cause the extraction to fail. You must therefore ensure that
exception and error handling routines are included.
• If there is a system or media failure during the process, the process may fail
entirely. You must start again or you may, depending upon system settings, be able
to continue from the point of failure.
• Metadata that inadequately describes the source to destination mapping and rules
will cause ETT to fail; for example, when an unexpected value is found.
• Without the space for temporary data, staging data, and sorting operations, ETT
fails.
• Any changes to the source systems that are not documented in metadata will cause
extraction to fail.
• Contingency plans are needed, including mechanisms for correcting or reapplying
processing.
• If data is not validated correctly, the quality of extraction and the success of
transformation cannot be guaranteed. This translates to a data warehouse that may
contain dirty data at the end of the load.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-31
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Maintaining ETT Quality
•
ETT must be:
– Tested
– Documented
– Monitored and reviewed
•
Disparate metadata must be coordinated
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-32
Data Warehousing Fundamentals
Extraction Techniques
.....................................................................................................................................................
Maintaining ETT Quality
Any failure of the ETT processes affects data quality, the importance of which cannot
be underestimated. Inaccurate data leads to inaccurate analysis results, which lead to
bad business decisions. The result of poor data quality is a lack of confidence in the
system to deliver the solution.
Testing the Process You should test the proposed ETT techniques to ensure that
volumes can be physically moved within the load window constraints and network
capabilities.
Documenting the Process You must communicate and document the proposed load
processes with the operations organization to ensure their agreement and commitment
to this important process.
Monitoring and Reviewing the Process You should ensure that the load is
constantly monitored and reviewed, and revise metrics where needed. Warehouse data
volumes grow rapidly, and metrics for load and data granularity need regular revision.
The grain of the warehouse affects query capabilities and the warehouse size.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-33
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Extraction Tools
Map Source Data to Intermediate File Store
Sales and Marketing
Customer Name
Varchar
Char
20
Mapping information
Unique name
JCL files
Update metadata
Copyright  Oracle Corporation, 1999. All rights reserved.
Selection Criteria
•
•
•
•
•
•
•
•
•
•
Base functionality
Interface features
Metadata repository
Open API
Metadata access
Repository utilities
Input and output processing
Cleansing, reformatting, and auditing
References
Training requirements
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-34
Data Warehousing Fundamentals
Extraction Tools
.....................................................................................................................................................
Extraction Tools
Extraction tools normally have a GUI front end that allows you to enter the individual
field mappings from source to target systems. The tools normally:
• Generate the required code for the mapping, whether COBOL, C, or any other
language
• Create the necessary job control and scheduling files for the specific platform
• Create and manage changes to the metadata
Selection Criteria
The warehouse uses a host of different tools for extraction, modeling, management,
and access. A tools selection committee must ensure that every tool selected meets
identified requirements. This is usually a rigorous process.
If you decide to buy an extraction tool, consider the following fundamental issues:
• Base functionality
• Interface features and functionality
• The metadata repository and the attributes stored in the repository
• Open API
• Access to metadata by end users
• The effectiveness of the way that the tool presents the information
• Repository utilities such as scheduling, name, and address management
• Data extraction inputs and outputs
• Data cleansing, reformatting, and auditing features
Ask the tool vendor for customer references, so that you can ask those customers to
describe their goals, successes, and failures with the product.
Consider the training required for the extraction tool. The complexity of the available
extraction products varies, as does the ability of your staff. Training may be required
for a few days or weeks.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-35
Lesson 10: Building the Warehouse
.....................................................................................................................................................
WTI Partner ETT Tools
•
•
•
•
•
•
•
•
•
Carleton
Constellar
Evolutionary Technologies
Informatica
Information Builders
Oracle EDMS, Toolkits, OADW
Prism Solutions
Sagent
Vality Technology
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-36
Data Warehousing Fundamentals
Extraction Tools
.....................................................................................................................................................
WTI Partner ETT Tools
WTI Partner
Product
Carleton Corp
Carleton Passport, Carleton Passport Development
Workbench
Constellar
Constellar Hub
Evolutionary Technologies
ETI Development Workbench, ETI Extract Tool Suite
Informatica Corporation
PowerMart (Designer, Server, and Manager)
Information Builders, Inc.
EDA Copy Manager
Oracle
EDMS (Extraction and Transformation Template)
Toolkits
OADW
Prism Solutions, Inc.
Prism Change Manager, Prism Development
Workbench, Prism Warehouse Manager
Sagent
Data Mart Suites
Vality Technology, Inc.
Integrity Data Re-engineering Tool
The choice of ETT techniques and tools is often driven by the quality of the source
data.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-37
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
•
ETT processes are essential and consume a large
proportion of warehouse resources and time
•
•
•
•
The extraction process acquires source data
You may encounter many data sources
There are many data extraction issues
ETT Tools should be considered
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-38
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
• ETT processes are essential and consume a large proportion of warehouse
resources and time
• The extraction process acquires source data
• You may encounter many data sources
• There are many data extraction issues
• ETT Tools should be considered
.....................................................................................................................................................
Data Warehousing Fundamentals
10-39
Lesson 10: Building the Warehouse
.....................................................................................................................................................
Practice 10-1 Overview
This practice covers the following topics:
•
•
Answering a series of short questions
Specifying true or false to a series of statements
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
10-40
Data Warehousing Fundamentals
Practice 10-1
.....................................................................................................................................................
Practice 10-1
Please answer the following questions.
1 The acronym ETT stands for _________________________________________.
2 Name at least four potential sources of production data for the warehouse.
_____________________
_____________________
_____________________
_____________________
3 Name at least five potential sources of external data for the warehouse.
___________________________________________
___________________________________________
___________________________________________
___________________________________________
___________________________________________
4 Identify whether the following statements are true or false.
Question
True
False
Archive data is never used in a data warehouse; it is too old.
External data is one of the easiest types of data to incorporate into the
warehouse.
Mapping data is a process whereby you eliminate data
inconsistencies.
Gateways are great mechanisms for transferring large volumes of
data into the warehouse.
Extraction tools are expensive.
Transforming data occurs only in the staging area.
.....................................................................................................................................................
Data Warehousing Fundamentals
10-41
Lesson 10: Building the Warehouse
.....................................................................................................................................................
.....................................................................................................................................................
10-42
Data Warehousing Fundamentals
11
.................................
Transforming Data
Lesson 11: Transforming Data
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Planning
Warehouse
Storage
Choosing a
Computing
Architecture
Meeting a
Business
Need
Modeling
the Data
Warehouse
ETT
ETT
(Building
(Building
the
the
Warehouse)
Warehouse)
Analyzing
User Query
Needs
Managing
the Data
Warehouse
Supporting
End User
Access
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
•
•
•
•
Explain the importance of quality data
Define the term “transformation”
Identify transformation issues
Describe techniques for transforming data
List tools that can be used to transform data
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
The last lesson introduced extraction, transformation, and transportation. The lesson
then focused on extraction issues.
In this lesson, you explore how the transformation process transforms data from
source systems into data suitable for end user query and analysis applications.
Note that the “ETT (Building the Warehouse)” block is highlighted in the overview
slide on the facing page.
Objectives
At the end of this lesson, you should be able to:
• Explain the importance of quality data
• Define the term “transformation”
• Identify transformation issues
• Describe techniques for transforming data
• List tools that can be used to transform data
.....................................................................................................................................................
Data Warehousing Fundamentals
11-3
Lesson 11: Transforming Data
.....................................................................................................................................................
Importance of Data Quality
Browser:
http://
Hollywood
Cus
tom
ers:+
X
a reco
Browser: C us
tom
http://
ers+
:
as
rof
Hollywood
X
Hollywood
Speedy Pizza
Browser:
http://
X
+
Hollywood
Summit Sports
Copyright  Oracle Corporation, 1999. All rights reserved.
Benefits of Quality Data
•
Clean data is essential for:
– Targeting customers
– Determining buying patterns
– Identifying householders: private and
commercial
– Matching customers
– Identify historical data
•
Dirty data must be removed.
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-4
Data Warehousing Fundamentals
Importance of Data Quality
.....................................................................................................................................................
Importance of Data Quality
Importance of Quality Data
The importance of quality data in the data warehouse cannot be overemphasized.
Although data anomalies are bound to exist in source systems, if they are allowed to
get into the data warehouse this leads to inaccurate information, which further leads to
inaccurate reports and bad business decisions. The overall result is a lack of
confidence in the system to deliver the solution and a data warehouse that either is not
used or requires substantial improvement and management buy-in.
Quality data is the key to a successful warehouse; it is better to have no data at all than
bad data.
Benefits of Quality Data
All dirty data must be eliminated from the staging area, to ensure you can query the
warehouse to:
• Target the right audience for marketing communication
• Determine that a particular customer buys related products
• Determine that a group of people form a family, each of whom is a potential
customer (householding)
• Identify that an organization is part of a larger enterprise (commercial
householding)
• Identify that a customer is now part of another organization, because of acquisition
or take over
• Match customers where there are many different records for the same customer.
(For example, the different components of health care, such as the hospital, the
pharmacy, and the doctor have their own records, or a patient may be treated by
different physicians in the same hospital.)
• Identify the age of data and its history
Note: The terms scrubbing, cleaning, cleansing, and data reengineering are used
interchangeably.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-5
Lesson 11: Transforming Data
.....................................................................................................................................................
Standards
•
•
Define a quality strategy
Decide on optimal data-quality level
Copyright  Oracle Corporation, 1999. All rights reserved.
Quality Improvements
•
•
•
•
•
Consider modifying rules for operational data
Document the sources
Create a data stewardship program
Design the cleanup process carefully
Initial cleanup and refresh routines may differ
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-6
Data Warehousing Fundamentals
Importance of Data Quality
.....................................................................................................................................................
Standards
A data-quality strategy must be defined early on in the development cycle. It is
imperative that you have one in place.
The strategy defines the optimal level of data quality that provides the value required
for the business. For example, there is little point in seeking a low data inconsistency
rate at great expense if the benefit to the business is not tangible.
Improving Operational Data Quality
You may need to consider making changes over time to the operational system in order
to improve the quality of data for the warehouse:
• Some of the validation and integrity rules that are applied to current operational
data may need to be modified or enhanced.
• You may need to document previously undocumented sources, enlist the help of
users who know the business data, and consider creating a “data stewardship”
program.
• You should carefully examine the cleanup processes that you employ in
transforming the extracted data.
• The initial data cleanup routines may be different from the routines applied to
subsequent data refreshes.
Correcting data can be tedious, time-consuming, and expensive. Consider any
modifications in a phased approach rather than fixing all problems in one attempt.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-7
Lesson 11: Transforming Data
.....................................................................................................................................................
Guidelines
•
Operational data should not be used directly in the
warehouse
•
Operational data must be cleaned for each
increment
•
Operational data is not simply fixed by modifying
applications
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-8
Data Warehousing Fundamentals
Importance of Data Quality
.....................................................................................................................................................
Guidelines
Do not assume that because the data in the operational system suits you at the
operational level, it is going to be appropriate, suitable, and of a sufficiently high
quality for the data warehouse.
• The operational system contains no aging information.
• There are many examples of disparity in the data.
• There are many different meanings applied to data.
• Good operational data when merged may become poor data warehouse data.
Do not assume it is acceptable to clean up data after the pilot run of the first increment
or implementation.
• The credibility of the data warehouse or data mart suffers.
• Postimplementation cleanups are more costly and the risk is higher than during the
pilot run.
• The programs needed to handle the multitude of problems are very complex and
would need to be rewritten after cleanup.
Do not assume that fixing applications at the point of entry (operational system) is
going to satisfy quality and clean up the data for the future.
• It is often too time-consuming and costly to continually implement changes at that
level.
• Changes cannot be implemented quickly enough to keep up with constantly
changing operational requirements.
The cost in time and resources in reengineering the existing legacy data may be too
high.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-9
Lesson 11: Transforming Data
.....................................................................................................................................................
Solutions
•
•
•
•
Conventional COBOL, 4GL
Specialized tools
Customized conversion process
Business experts
Investigation
Conditioning
Standardization
Integration
Copyright  Oracle Corporation, 1999. All rights reserved.
Management
Poor data quality
•
•
•
•
Own
Take responsibility
Resolve problems
Data quality manager
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-10
Data Warehousing Fundamentals
Importance of Data Quality
.....................................................................................................................................................
Solutions
Use conventional COBOL or 4GL programs or purchase a specialized tool to capture
and eradicate anomalies prior to data load. It is often very difficult to predict all
possible variants.
You may consider designing a process in-house to assure the quality of the data
entering the data warehouse. The process must involve:
• Data investigation: Parsing, lexical analysis, and pattern investigation
• Data conditioning and standardization: Moving the data into fixed fields,
standardizing names and addresses
• Data integration: Building unique keys and integrating the data
You should involve the business experts in the entire warehouse ETT process.
Management
You must manage the quality of the data, processes, and rules, and put people in place
to manage them. Someone must own, be directly responsible for, and resolve the issue
of poor data quality. This person is often known as the data quality manager.
Note: At some sites there is a person or a group responsible for name and address
management alone.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-11
Lesson 11: Transforming Data
.....................................................................................................................................................
Transformation
Clean up
Consolidate
Restructure
Operational
system
Extract
Data
staging
area
Transport
(Load)
Transform
Warehouse
Transformation eliminates operational data anomalies
•
•
•
Cleans
Standardizes
Presents subject-oriented data
Copyright  Oracle Corporation, 1999. All rights reserved.
Source Data Anomalies
•
•
•
•
No unique key
Data naming and coding anomalies
Data meaning anomalies between groups
Spelling and text inconsistencies
CUSNUM NAME
90328575
90328575
90238475
90233479
90233489
90234889
90345672
Oracle Corp
Oracle
Oracle Services
Oracle Limited
Oracle Computing
Oracle Corp. UK
Oracle Corp UK Ltd
ADDRESS
100 NE 1st Street, Tampa
100 NE. First St., Tampa
100 North East 1st St., FLA
100 N.E. 1st St.
15 Main Road, Ft. Lauderdale
15 Main Road, Ft. Lauderdale, FLA
181 North Street, Key West, FLA
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-12
Data Warehousing Fundamentals
Transformation
.....................................................................................................................................................
Transformation
Transformation involves a number of tasks, the most important being to eliminate all
anomalies. Cleaning also includes eliminating formatting differences, assigning data
types, defining consistent units of measure, and determining encoded structures.
Along with these tasks, another objective is to ensure that the data is presented in a
subject-oriented fashion.
Reasons for Data Anomalies
One of the causes of inconsistencies within internal data is that in-house system
development takes place over many years, often with different software and
development standards for each implementation.
There may be no consistent policy for the software used in the corporate environment.
Systems may be upgraded or changed over the years. Each system may represent data
in different ways.
Source Data Anomalies
Many potential problems can exist with source data:
• No unique key for individual records
• Anomalies within data fields, such as differences between naming and coding
(data type) conventions
• Differences in the interpreted meaning of the data by different user groups
• Spelling errors and other textual inconsistencies (this is particularly relevant in the
area of customer names and addresses)
.....................................................................................................................................................
Data Warehousing Fundamentals
11-13
Lesson 11: Transforming Data
.....................................................................................................................................................
Transformation Routines
•
•
•
•
•
•
Cleaning data
Eliminating inconsistencies
Adding elements
Merging data
Integrating data
Transforming data before load
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-14
Data Warehousing Fundamentals
Transformation
.....................................................................................................................................................
Transformation Routines
One reason for the inconsistencies with internal data is that in-house system
development takes place over many years and often uses different software and
standards for each implementation.
• Cleaning the data, also referred to as data cleansing or scrubbing
• Adding an element of time to the data, if it does not already exist
• Translating the formats of external and purchased data into something meaningful
for the warehouse
• Merging rows or records in files
• Integrating all the data into files and formats to be loaded into the warehouse
Transformation should be performed:
• Before the data is loaded into the warehouse
• In parallel (On larger databases, there is not enough time to perform this process as
a single threaded process.)
The transformation process should be self-documenting, should generate summary
statistics, and should process exceptions.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-15
Lesson 11: Transforming Data
.....................................................................................................................................................
Transforming Data: Problems and
Solutions
Multipart keys
Product code = 12M65431345
Country Sales
code
territory
Product Salesperson
number code
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-16
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Transforming Data: Problems and Solutions
Multipart Keys Problem
Many older operational systems used record key structures that had a built-in meaning.
To allow for decision support reporting, these keys must be broken down into atomic
values.
In the example, the key contains four atomic values.
Key Code:12M65431345
Where:
12 is the country code
M is the sales territory
65431 is the product code
345 is the salesperson
Solution The program or tools you use must be capable of identifying on a
character-by-character (or position-by-position) basis the individual values, length of
value, and the meaning of the resulting information. In the example quoted it is
important that the code can extract the M and know that this is a territory code that
identifies “Midwest,” “Manchester,” or “Moscow.”
You may need to build a series of transforms to evaluate the results fully. For example,
these steps may be appropriate:
1 Extract third character position.
2 Evaluate the character against a master lookup table.
3 Evaluate the meaning of M.
4 Store the meaning (Moscow) in a field for insertion into the data warehouse.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-17
Lesson 11: Transforming Data
.....................................................................................................................................................
Transforming Data
•
Multiple encoding
m,f
1,0
m, f
male, female
•
Must pick up erroneous data
mle, female
1 , NULL
If field not in (‘m’,1,’male’)
then …
m, f
else if field is NULL
then …
Copyright  Oracle Corporation, 1999. All rights reserved.
Transforming Data
•
•
Multiple local standards
Tools or filters to preprocess
cm
cm
inches
DD/MM/YY
DD-Mon-YY
MM/DD/YY
1,000 GBP
USD 600
FF 9,990
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-18
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Multiple Encoding Problem
Some systems may represent values in different ways.
For example, some systems may use M to denote “male” and F to denote “female”,
while others use 1 and 0, or even NULL values.
Solution The program must be capable of identifying all the distinct possibilities
and program for exceptions. For example, your program considers a male might be
either M, or NULL, or Male, but it does not take into account spurious and bad entries
such as Man, Mle, N/A.
Your program must be capable of picking up the spurious and bad entries and
changing the values to something appropriate, such as:
1 Select all M, or NULL, or Male.
2 Place all other records into a file for reprocessing.
3 Interpret records to be reprocessed and determine from other related values in the
record whether the person is male or female.
4 Change value accordingly, and reprocesses rows selecting newly marked records.
Multiple Local Standards Problem
This is particularly relevant for values entered in different countries.
For example, some countries use imperial measurements and others metric; currencies
and date formats differ; currency values and character sets may vary; and numeric
precision values may differ.
Currency values are often stored in two formats, a local currency such as sterling,
French francs, or Australian dollars, and a global currency such as U.S. dollars.
Solution Typically, you use tools or filters to preprocess this data into a suitable
format for the database, with the logic needed to interpret and reconstitute a value. You
might employ steps similar to those identified for multiple encoding.
You may consider revising source applications to eliminate these inconsistencies
early on.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-19
Lesson 11: Transforming Data
.....................................................................................................................................................
Multiple Files Problem
•
•
Added complexity of multiple source files
Start simple
Multiple
source files
Extracted
data
Logic to detect
correct source
Copyright  Oracle Corporation, 1999. All rights reserved.
Transforming Data from Multiple Files
File
File
File
File
16
14
12
10
8
6
4
2
0
File
File
File
File
File
Conflict and integration points
2
3
4
5
6
Sources to be Incorporated
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-20
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Multiple Files Problem
The source of information may be one file for one condition, and a set of files for
another. Logic (normally procedural) must be in place to detect the right source.
The complexity of integrating data is greatly increased according to the number of
data sources being integrated.
For example, if you are integrating data from two sources, there is a single point of
integration where conflicts must be sorted. Integrate from three sources, and there are
three points of conflict. Four sources provide six conflict points. The problem is
exponential.
Solution This is a complex problem that requires the use of tools or welldocumented transformation mechanisms.
Try not to integrate all the sources in the first instance. Start with two or three and then
enhance the program to incorporate more sources. Build on your learning experiences.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-21
Lesson 11: Transforming Data
.....................................................................................................................................................
Missing Values Problem
Solution
•
•
•
•
Ignore
Wait
Mark rows
Extract when time-stamped
If NULL then
field = ‘A’
A
Copyright  Oracle Corporation, 1999. All rights reserved.
Duplicate Value Problem
Solution
•
•
SQL self-join techniques
RDMBS constraint utilities
ACME Inc
SELECT …
FROM table_a, table_b
WHERE table_a.key (+) = table_b.key
UNION
SELECT …
FROM table_a, table_b
WHERE table_a.key = table_b.key (+)
ACME Inc
ACME Inc
ACME Inc
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-22
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Missing Values Problem
Null, missing, and default values are always an issue. NULL values may be valid
entries where NULLs are allowed; otherwise, NULLs indicate missing values.
Solution You must examine each occurrence of the condition to determine validity
and decide whether these occurrences must be transformed; that is, identify whether a
NULL is valid or invalid (missing data). You may choose to:
• Ignore the missing data. If the volume of records is relatively small, it may have
little impact overall.
• Wait to extract the data until you are sure that missing values are entered from the
operational system.
• Mark rows when extracted, so that on the next extract you can select only those
rows not previously extracted. It does involve the overhead of SELECT and
UPDATE, and if the extracted data forms the basis of a summary table, these need
re-creating.
• Extract data only when it is time-stamped as completed, rather than by business
cycle.
Duplicate Value Problem
You need to eliminate duplicate values, which invariably exist. This can be timeconsuming, although it is a simple task to perform.
Solution You can use standard SQL self-join techniques or RDBMS constraint
utilities to eliminate duplicates.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-23
Lesson 11: Transforming Data
.....................................................................................................................................................
Element Names Problem
•
•
a recoro
f
as
Browser:Cus to
me +
http://
rs:
Solution
Hol lywood
X
Customer
CTAS
SQL*Loader
Browser:
http://
X
Browser:
http://C
us
Hollywoo d
12345.00
12780.00
2345787.00
87877.98
5678.00
+
Hollywood
tom
ers+ X
:
100%
ABC CO
110% GMBH LTD
230% GBUK INC
200% FFR ASSOC
-10%
MCD CO
Client
Customer
Contact
Name
Copyright  Oracle Corporation, 1999. All rights reserved.
Element Meaning Problem
Customer’s
name
me
rs:
a recoro
f
http://
Hollywood
Hollywoo d
Cu
sto
+
All details
except name
•
•
•
X
as
Browser:
All customer
details
Customer_detail
Avoid misinterpretation
Complex solution
Document meaning in
metadata
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-24
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Element Names Problem
Individual attributes, columns, or fields may vary in their naming conventions from
one source to another. These need to be eliminated to ensure that one naming
convention is applied to the value in the warehouse.
If you are employing independent data marts, then you should ensure that the ETT
solution is mirrored; should you plan to employ the data marts dependently in the
future, they will all refer to the same object.
Solution You need to obtain agreement from all relevant user groups on renaming
conventions, and rename the elements accordingly. Document the changes in
metadata.
The programs you use determine the solution. For example, if you are using SQL
CREATE TABLE AS (CTAS), the new column name is used in that statement. If you
use SQL*Loader as an intermediary mechanism prior to load, you create your
destination object with the agreed naming convention applied.
Agreement on the name change and the meaning of the data can become a political
issue between groups and departments in the organization.
Element Meaning Problem
Like the name of an element, the meaning is often interpreted differently by different
user groups. The variations in naming conventions typically drive this
misinterpretation. You need to keep your model independent of naming conventions
that may be popular today, but subject to change.
Solution It is a difficult problem, often political, but you must ensure that the
meaning is clear. By documenting the meaning in metadata you can solve this
problem, especially if the meaning is composed of several elements and algorithms
have been used.
In order to take information from the operational system into the warehouse, you must
know the meaning of the data. This may involve rebuilding the transaction from its
component parts (which are likely in a normalized state). You must know the:
• Business rules
• Processes executed for a type of transaction, such as the tables that are updated
This is a complex task, which may involve merging or separating data components,
extracting values from multipart keys, and much more.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-25
Lesson 11: Transforming Data
.....................................................................................................................................................
Input Format Problem
EBCDIC
ASCII
“123-73”
12373
ACME Co.
áøåëéí äáàéí
Beer (Pack of 8)
Copyright  Oracle Corporation, 1999. All rights reserved.
Referential Integrity Problem
Solution
•
•
•
SQL anti-join
Server constraints
Dedicated tools
Department
10
20
30
40
Emp
1099
1289
1234
6786
Name
Smith
Jones
Doe
Harris
Department
10
20
50
60
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-26
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Input Format Problem
Input formats vary considerably.
For example one entry may accept alphanumeric data, so the format may be “123-73”.
Another entry may accept numeric data only, so the format may be “12373”.
You may also need to convert from ASCII to EBCDIC, or even convert complex
character sets such as Hebrew, Arabic, or Japanese.
Solution First, ensure that you document the original and the resulting formats.
Your program (or tool) must then convert those data types either dynamically or
through a series of transforms into one acceptable format.
You can use Oracle SQL*Loader to perform certain transformations, such as EBCDIC
to ASCII conversions and assigning values to default or NULL values.
Referential Integrity Problem
If the constraints at the application or database level have in the past been less than
accurate, child and parent record relationships can suffer; orphaned records can exist.
You must understand data relationships built into legacy systems. The biggest problem
encountered here is that they are often undocumented. You must gain the support of
users and technicians to help you with analysis and documentation of the source data.
Solution This is a simple cleaning task, but it is time-consuming and requires
business experience to resolve the inconsistencies. You can use SQL anti-join query
techniques, server constraint utilities, or dedicated tools to eliminate these
inconsistencies.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-27
Lesson 11: Transforming Data
.....................................................................................................................................................
Name and Address Problem
•
•
•
•
•
•
•
No unique key
Missing values
Personal and commercial names mixed
Different addresses for same member
Different names and spelling for same member
Many names on one line
One name on two lines
NAME
LOCATION
Database 1
DIANNE ZIEFELD
HARRY H. ENFIELD
FRED AND SARA MULLEN
N100
D589
M300
Database 2
ZIEFLED, DIANNE
ENFIELD, HARRY H
MULLEN, SARA AND FRED
100
589
300
Copyright  Oracle Corporation, 1999. All rights reserved.
Name and Address Problem
•
Single-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
•
Multiple-field format
Name
Street
Town
County
Code
Mr. J. Smith
100 Main St.
Bigtown
County Luth
23565
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-28
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Name and Address Problem
One of the largest areas of concern, with regard to data quality, is how name and
address information is held, and how to transform it. Name and address information
has historically suffered from a lack of legacy standards. This information has been
stored in many different formats, sometimes dependent upon the software or even the
data processing center used.
Usual Inconsistencies Some of the following data inconsistencies may appear:
• No unique key
• Missing data values (NULLs)
• Personal and commercial names mixed
• Different addresses for same member
• Different names and spelling for same member
• Many names on one line
• One name on two lines
• The data may be in a single field of no fixed format:
Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565
Each component of an address may be in a specific field:
Mr. J. Smith
100 Main St.
Bigtown
County Luth
23565
.....................................................................................................................................................
Data Warehousing Fundamentals
11-29
Lesson 11: Transforming Data
.....................................................................................................................................................
Clean and Organize
1. Create atomic values.
2. Standardize formats.
3. Verify data accuracy.
4. Match with other records.
5. Identify private and commercial addresses
and inhabitants.
6. Document in metadata.
Requires sophisticated tools and techniques
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-30
Data Warehousing Fundamentals
Transforming Data: Problems and Solutions
.....................................................................................................................................................
Name and Address Problem (continued)
Solution Name and address cleanup involves a series of complex processes that
decompose and reassemble data. It can be broken down into a number of steps; those
identified here represent just one example.
Mr. J. Smith, 100 Main St., Bigtown, County Luth, 23565
Steps to Clean and Organize
1 Break the record down into atomic values, each of which has a description.
Value
Title
First Initial
Last Name
House Number
....
Description
Mr.
J
Smith
100
....
2 Ensure that all elements appear in a standard format, so that St. in this example
becomes Street. This element needs to be recoded, as do other similar elements,
such as Rd and Cres.
3 Verify the accuracy of standard elements using data from external sources.
– Is Bigtown actually associated with this postal code?
– Is Bigtown in County Luth?
– Is County Luth associated with this postal code?
4 Check whether there are any other customers with the name Smith. If there are,
verify whether the addresses are identical; if they are not, then one is probably the
current address and others are old addresses. You probably have to refer to external
data to check this. Mark records with notes such as previous and current.
5 Identify whether there is more than one customer record for any given address.
You may find a Smith, and a Doe, and a Jones all at 100 Main Street. Are they all
resident in the same house or apartment?
6 Document the results of these steps in metadata.
You can see from the complexity of even this simple example that this cleanup
requires sophisticated software techniques, tools, or expert knowledge in coding the
algorithms required to perform each step.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-31
Lesson 11: Transforming Data
.....................................................................................................................................................
Merging Data
•
Operational transactions do not usually map
one-to-one with warehouse data
•
Data for the warehouse is merged to provide
information for analysis
Pizza sales/returns by day, hour, seconds
Sale
1/2/98 12:00:01 Ham Pizza
$10.00
Sale
1/2/98 12:00:02 Cheese Pizza
$15.00
Sale
1/2/98 12:00:02 Anchovy Pizza
$12.00
Return 1/2/98 12:00:03 Anchovy Pizza
- $12.00
Sale
1/2/98 12:00:04 Sausage Pizza
$11.00
Copyright  Oracle Corporation, 1999. All rights reserved.
Merging Data
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
Browser:
http://
Cu
stom
ers:
+
12:00:02 Anchovy Pizza $12.00
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
a reco
rof
1/2/98
as
XX
Sale
H
Hollywo
ollywood
od
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-32
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Transformation Techniques
Merging Data
An operational transaction does not usually have a one-to-one mapping with data in
the warehouse, even if the data in the warehouse is maintained at the transaction level.
For example, consider a sales transaction in a store. The logical transaction comprises
a number of components such as date of sale, charge amount, number of items,
discount amount, and payment method. The transaction may even be a return.
A customer purchase and a customer return are very different types of sales
transactions, and different business rules must apply. For each different transaction a
different process occurs. A purchase depletes inventory and a return adds stock back
into inventory.
The result is, for the warehouse, that the data you are keeping is held for purely
reporting purposes and these transactions become merged into data that is useful for
that purpose. The data will not, in the end, map strictly to sales or returns.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-33
Lesson 11: Transforming Data
.....................................................................................................................................................
Adding a Date Stamp
•
•
•
Enables time analysis
Label loaded data with a date stamp
Add time to fact and dimension data
Copyright  Oracle Corporation, 1999. All rights reserved.
Adding a Date Stamp
Product Table
Product_id
Time_key
Product_desc
Store Table
Store_id
District_id
Time_key
Sales Fact Table
Item_id
Store_id
Time_key
Sales_dollars
Sales_units
Time Table
Week_id
Period_id
Year_id
Time_key
Item Table
Item_id
Dept_id
Time_key
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-34
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Adding a Date Stamp
Time is important within the data warehouse. You have already looked at the time
dimension, which is always created in the warehouse in order to provide reporting by
time periods.
Extracted source data probably does not contain time information, because it is not
typical of time-stamp information in operational systems (unless of course they too are
maintaining history, or time is a critical component). More likely the record in the
operational system has a value associated with it, such as Order_date, Ship_date, or
Call_date.
Therefore it is important to consider how you are going to add a time element to your
warehouse data. This is particularly important for two areas of the warehouse:
• Fact tables that hold vast amounts of data used to analyze the business according to
time periods
• Dimension data containing criteria by which you perform the analysis
You need to consider how to manage time for both of these areas, in slightly different
ways.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-35
Lesson 11: Transforming Data
.....................................................................................................................................................
Adding a Date Stamp
•
Fact table
– Add triggers
– Recode applications
– Compare tables
•
•
Dimension table
Time representation
– Point in time
– Time span
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-36
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Adding a Date Stamp (continued)
Fact Table Data Imagine that you need to add the next set of records from the
source systems to your fact table. You need to determine which records are to be
moved into the fact table. You have added data for March 1998. Now you need to add
data for April 1998. You need to find a mechanism to stamp records so that you pick
up only April 1998 records for the next refresh.
You might choose from a number of techniques: Coded application or database
triggers at the operational level to time-stamp data, which can then be extracted using
date selection criteria.
• Perform a comparison of tables, original and new, to identify differences.
• Maintain a table containing copies of changed records to be loaded.
You must decide which are the best techniques for you to use according to your current
system implementations. These are discussed in greater detail later in the course.
Dimension Table Data Dimensions change also and there are many different
techniques you can employ to trap changes. Some of these were identified earlier with
fact tables.
Time Representation The time may be represented as:
• A single point-in-time date
• A date range (start and end date)
The time element must either be available in the data before loading into the
warehouse, or added when loading the data.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-37
Lesson 11: Transforming Data
.....................................................................................................................................................
Adding Keys to Data
#1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#3
Sale
1/2/98
12:00:02 Anchovy Pizza $12.00
#4
Return 1/2/98
12:00:03 Anchovy Pizza - $12.00
#5
Sale
12:00:04 Sausage Pizza $11.00
1/2/98
Data values
or artificial keys
#dw1
Sale
1/2/98
12:00:01 Ham Pizza
$10.00
#dw2
Sale
1/2/98
12:00:02 Cheese Pizza
$15.00
#dw3
Sale
1/2/98
12:00:04 Sausage Pizza $11.00
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-38
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Adding Keys to Data
You are moving the data from one structure, with its keys defining relationships, into
another that is totally different and must also have keys defining relationships.
The transformation of this data also includes adding keys (generalized or artificial) or
creating keys from existing data values.
Note: Creating keys is discussed in more detail later in the course.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-39
Lesson 11: Transforming Data
.....................................................................................................................................................
Summarizing Data
During extraction on staging area
Hollywoo d
a recoro
Browser: Cus
tom
http://
f
After loading onto the warehouse server
ers +
:
as
•
•
X
Operational
databases
Staging
area
Warehouse
database
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-40
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Creating Summary Data
Creating summary data is essential for a data warehouse to perform well. Here it is
classified under transformation only because you are changing the way the data exists
in the source system into something else for the data warehouse.
In reality, the summary data is usually created on the warehouse server after
transformation.
Summarizing Data You can summarize the data:
• At the time of extraction in batch routines.
This reduces the amount of work performed by the data warehouse server, as all
the effort is concentrated on the source systems. However, summarizing at this
time increases:
– The complexity and time taken to perform the extract
– The number of files created
– The number of load routines
– The complexity of the scheduling process
• After the data is loaded into the warehouse database.
The process queries the fact data, summarizes it, and places it into the requisite
summary fact table. This method reduces the complexity and time taken for the
extract tasks. However, it places all the CPU and I/O intensive work on the
warehouse server, thus increasing the time that the warehouse is unavailable to the
users.
You should weigh the benefits of each method and determine your strategy according
to your requirements and resources.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-41
Lesson 11: Transforming Data
.....................................................................................................................................................
Maintaining Transformation Metadata
Contains transformation rules, algorithms, and
routines
Browser:
Cus
http://
Browser:
http://
X
+
Hollywood
er+sX
:
a rec
orof
as
Hollywood tom
Cus
Browser:
http:// tom er+ X
s:
Hollywood
Sources
Stage
Rules
Extract
Transform
Publish
Load
Query
Copyright  Oracle Corporation, 1999. All rights reserved.
Maintaining Transformation Metadata
•
•
•
•
•
•
•
Key restructuring
Coding differences
Multiple sources
Exception rules
Format differences
Referential integrity fixes
Aggregated data
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-42
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Maintaining Transformation Metadata
As with the extraction process, metadata must be maintained for the transformation
process.
• Information on how to perform key restructuring
• Logic to eliminate different coding methods and data values, parsing rules
• Logic to detect multiple source files
• Logic and exception rules to handle NULL, negative values, and default values
and to eliminate and consolidate duplicate values
• Element renaming conventions
• Granularity conversions, input or language formats, conversion algorithms, and
data standardization rules
• Referential integrity fixes
• Logic and program names used to create summary data
• Transformation frequency, program name, location, failure procedures, and
validation
• Temporary extraction storage location, name, and source contact
The metadata also contains information about the frequency of program execution.
Data repair usually involves using simple algorithms or more complex artificial
intelligence programs to correct data.
Note: There is a lesson dedicated to metadata later in the course.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-43
Lesson 11: Transforming Data
.....................................................................................................................................................
Data Ownership and Responsibilities
•
•
•
Operational and application development teams
Data warehouse development team
Business benefit gained with a one-team approach
Browser:
Holly
wood
Hollywood
C us
tom
ers:+
Browser:
http://
XX
+
Hollywoo
Ho llywoodd
XX
Browser: C us
tom
http://
ers:+
Holly
wood
Hollywood
a reco
rof
as
http://
XX
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-44
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Data Ownership and Responsibilities
Ownership The data extracted from the source systems is often under the control
and ownership of application development teams who have been working with the
operational data since its inception. The loading of the data into the warehouse is
usually under the control of the data warehousing development team.
This raises the question of who is responsible for the transformation of the data: the
process between developing and loading the data into the warehouse.
Working as One Team These two teams must work together—those responsible for
operational data and those responsible for warehouse data. It brings all the required
knowledge together and produces the best solution. Working together enhances
understanding, knowledge, teamwork, and a leveling of roles within the groups.
• The operational team may be critical to ensuring the success of the data extraction
and providing the data warehouse team with extract files in requisite formats (for
example C, COBOL, PL/SQL).
• The data warehouse team can then take on the task of making sure the extracted
data is accurate and of sufficiently high quality for the warehouse.
If there is a need to reconsider how the operational data is entered (stored at the
database level), to improve the ease of creating extracts and the quality of extract data,
then teamwork and understanding of each other’s areas are critical.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-45
Lesson 11: Transforming Data
.....................................................................................................................................................
Transformation Timing and Location
•
Transformation is performed:
– Before load
– In parallel
•
May be initiated at different points
12M65431
12M65431
12
M
65431
12
M
65431
12-m-65421
12-m-65421
12
m
65421
12
M
65421
“12m65421”
“12m65421”
12
m
65421
12
m
65421
“12m65421”
“12m65421”
“
”
12M65431
Unlikely
“
”
12M65431
Probable
Possible
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-46
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Transformation Points
You need to consider carefully when and where you perform transformation. You must
perform transformation before the data is loaded into the warehouse, and in parallel;
on larger databases, there is not enough time to perform this process as a single
threaded process.
Consider the different places and points in time where transformation may take place.
On the Operational Platform This approach transforms the data on the operational
platform, where the source data resides.
The negative impact of this approach is that the transformation operation conflicts
with the day-to-day working of the operational system.
If it is chosen, the process should be executed when the operational system is idle or
less utilized. The impact of this approach is so great that is very unlikely to be
employed.
In a Separate Staging Area This approach transforms data on a separate computing
environment, the staging area, where summary data may also be created.
This is a common approach because it does not affect either the operational or
warehouse environment. Cleaning, merging, and removal of anomalies are handled in
the staging area, and summary creation may take place:
• On the staging server
• On the warehouse server
On the Warehouse Server You may consider performing transformations on the
warehouse server itself. However, this may affect the effectiveness of the server for
query access.
It is more likely that you transform away from the warehouse server.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-47
Lesson 11: Transforming Data
.....................................................................................................................................................
Choosing a Transformation Point
•
•
•
•
Workload
•
Network bandwidth
Environment impact
•
Parallel execution
CPU use
•
Load window time
Disk space
•
User information
needs
Copyright  Oracle Corporation, 1999. All rights reserved.
Monitoring and Tracking
Transforms should:
•
•
•
Be self-documenting
Provide summary statistics
Handle process exceptions
12M65431
12M65431
12
M
65431
12
M
65431
12-m-65421
12-m-65421
12
m
65421
12
M
65421
“12m65421”
“12m65421”
12
m
65421
12
m
65421
“12m65421”
“12m65421”
1
“
”
12M65431
“
”
2
3
4
5
1,200
1,400
100
6,001
20,890
12M65431
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-48
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Choosing a Transformation Point
The approach you choose depends upon operational requirements. You must balance
many different factors in order to determine the best solution. Consider:
• The actual workload (time to complete) of the transformations needed to provide
the data for the warehouse
• The physical impact on each of the environments you might choose. (This is
particularly relevant if you choose to use the operational platform.)
• The available CPU and disk space (for temporary and intermediate data and file
store) on each environment
• The available network and bandwidth between environments, affecting transfer
volumes
• Whether the environment is capable of working in a parallel manner
• The load window time constraints
• The information needs of the business user. (When do they need this data? How
often do refreshes occur?)
Monitoring and Tracking
The transformations should be self-documenting, should generate summary statistics,
and should be able to process exceptions.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-49
Lesson 11: Transforming Data
.....................................................................................................................................................
Designing Transformation Processes
•
Analysis:
– Sources and target mappings,
business rules
– Key users, metadata, grain
•
Design options: PL/SQL, replication,
custom, third-party tools
•
Design issues:
– Performance
– Size of the staging area
– Exception handling, integrity
maintenance
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-50
Data Warehousing Fundamentals
Transformation Techniques
.....................................................................................................................................................
Designing Transformation Processes
When designing your transformation processes, consider the analysis issues, the
design options available to you, and the design issues.
Analysis
• Source and target mappings
• Business rules
• Key users
• Metadata
• Granularity of the fact data and summaries
Design Options
• PL/SQL
• Replication
• Custom 3GL programs
• Third-party tools
Design Issues
• Performance and throughput
• Sizing the staging areas to hold the data to be loaded into the warehouse
• Exception handling
• Integrity maintenance
.....................................................................................................................................................
Data Warehousing Fundamentals
11-51
Lesson 11: Transforming Data
.....................................................................................................................................................
Transformation Tools
•
•
•
Purchased
SQL*Loader
In-house developed
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-52
Data Warehousing Fundamentals
Transformation Tools
.....................................................................................................................................................
Transformation Tools
Many of the purchased transformation tools perform extraction as well. The choice of
transformation tool may already have been decided when you chose the extraction
tool. However, transformation can be performed by:
• Tools purchased from specialized vendors both third-party and Oracle
• SQL*Loader. This is an Oracle product that is commonly used to transport large
volumes of data into the warehouse tables. It can also provide you with simple data
transformations, such as multiple records becoming a single record, or conversely
a single record at source becoming multiple records for the data warehouse.
• In-house developed programs and procedures using 3GL products such as C, C++,
COBOL, or 4GL products such as SQL and PL/SQL. The DECODE SQL function
can be used to test a value and change it to another value. For example, change
“M” and “F” to Male and Female.
DECODE is fast, because it is a SQL set processing function and takes advantage
of parallel processing. You should be aware that PL/SQL does not take advantage
of parallel processing capabilities and is slower than DECODE because it
processes row by row.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-53
Lesson 11: Transforming Data
.....................................................................................................................................................
Data Management, Quality, and Auditing
Tools
•
Data management:
– Innovative Systems
– Postalsoft
– Vality Technology
•
Data quality and auditing:
– Innovative Systems
– Vality Technology
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-54
Data Warehousing Fundamentals
Transformation Tools
.....................................................................................................................................................
Data Management, Quality and Auditing Tools
Management Tools
WTI Partner
Product
Innovative Systems, Inc.
Innovative Warehouse
Postalsoft, Inc.
Address Correction and Encoding (ACE)
Vality Technology, Inc.
Integrity Data Re-engineering Tool
Quality and Auditing Tools
WTI Partner
Product
Innovative Systems, Inc.
ISI Analyzer System
Vality Technology, Inc.
Integrity Data Re-engineering Tool
.....................................................................................................................................................
Data Warehousing Fundamentals
11-55
Lesson 11: Transforming Data
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
•
•
•
•
•
•
Importance of data quality
Transformation process
Data transformation issues
Data anomalies
Name and address management
Tools
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-56
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson addressed the following topics:
• The importance of data quality in the warehouse
• The transformation process
• Transformation issues
• Anomalies that may exist in legacy systems
• Name and address management
• Tools available for extraction, transformation, and data quality
.....................................................................................................................................................
Data Warehousing Fundamentals
11-57
Lesson 11: Transforming Data
.....................................................................................................................................................
Practice 11-1 Overview
This practice covers the following topics:
•
•
Answering a series of short questions
Specifying true or false to a series of statements
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
11-58
Data Warehousing Fundamentals
Practice 11-1
.....................................................................................................................................................
Practice 11-1
1 Dirty data must be eliminated for the data warehouse. Name three alternative and
common terms used to describe the process of eliminating anomalies in data.
_____________________
_____________________
_____________________
2 Name at least five problems associated with source data that must be eliminated
for the data warehouse.
___________________________________________
___________________________________________
___________________________________________
___________________________________________
___________________________________________
3 Identify whether the following statements are true or false.
Question
True
False
It is considered impractical to eliminate data anomalies after the pilot
run.
You need to consider adding time keys to warehouse data.
Transformation can be performed before or after data is loaded into
the warehouse.
.....................................................................................................................................................
Data Warehousing Fundamentals
11-59
Lesson 11: Transforming Data
.....................................................................................................................................................
.....................................................................................................................................................
11-60
Data Warehousing Fundamentals
12
.................................
Transportation: Loading
Warehouse Data
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Planning
Warehouse
Storage
Choosing a
Computing
Architecture
Meeting a
Business
Need
Modeling
the Data
Warehouse
ETT
ETT
(Building
(Building
the
the
Warehouse)
Warehouse)
Analyzing
User Query
Needs
Managing
the Data
Warehouse
Supporting
End User
Access
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
Explain key concepts in transporting data into the
warehouse
•
Outline how to build the transportation process for
first time load
•
•
Identify transportation techniques
•
Explain the issues involved in designing the
transportation, loading, and scheduling processes
Identify the tasks that take place after data is
loaded
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
In the last two lessons, you examined extraction and transformation issues.
In this lesson, you examine how the extracted and transformed data is transported into
the warehouse as the first-time loading of data.
Note that the “ETT (Building the Warehouse)” block is highlighted in the overview
slide on the facing page.
Objectives
At the end of this lesson, you should be able to:
• Explain key concepts in transporting data into the warehouse
• Outline how to build the transportation process for the first time load
• Identify transportation techniques
• Identify the tasks which take place after data is loaded
• Explain the issues involved in designing the transportation, loading, and
scheduling processes
.....................................................................................................................................................
Data Warehousing Fundamentals
12-3
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Transporting Data into the Warehouse
•
•
Loading moves the data into the warehouse
Loading can be time-consuming:
– Consider the load window.
– Schedule the task; automate all processes.
•
•
•
Initial load moves large volumes
Subsequent refresh moves smaller volumes
Business determines the cycle
Operational
System
Extract
Data
Staging
Area
Transport
(load)
Transform
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-4
Data Warehousing Fundamentals
Transporting Data into the Warehouse
.....................................................................................................................................................
Transporting Data into the Warehouse
Transportation Tasks
The transportation process moves data from source data stores or an intermediate
staging area and loads it into the target warehouse database in the target system server.
This process comprises a series of actions, such as moving the data and loading data
into tables. There may also be some processing of objects after the load, often referred
to as postload processing.
Moving and Loading Data
To move and load the data can be a time-consuming task, depending upon the volumes
of data, the hardware, the connectivity setup, and whether parallel operations are in
place. The time period within which the warehouse system can perform the load is
called the load window.
Loading should be scheduled and prioritized. You should also ensure that the loading
is automated as much as possible.
Types of Data Load
There is a single first-time load that moves large volumes of data when the warehouse
is implemented. The first-time load is followed by regular refreshes of the warehouse
with smaller volumes of data, the grain and frequency of which is determined by the
business user requirements.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-5
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Extract Processing Environment
Operational
databases
T1
•
•
T2
T3
After each time interval, build a new database
Run queries
Copyright  Oracle Corporation, 1999. All rights reserved.
Warehouse Processing Environment
Operational
databases
T1
•
•
•
•
T2
T3
Build a new database
After each time interval, add changes to database
Archive or purge oldest data
Run queries
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-6
Data Warehousing Fundamentals
Transporting Data into the Warehouse
.....................................................................................................................................................
Data Refresh Models
First, to ensure that you understand how the warehouse data presentation differs from
nonwarehouse data presentation, consider how up-to-date data is presented to users in
two different decision support environments: a simple extract processing environment
and a data warehouse environment.
Extract Processing Environment A snapshot of operational data is taken at regular
time intervals: T1, T2, and T3. At each interval a new snapshot of the database is
created and presented to the user; the old snapshot is purged.
Warehouse Environment An initial snapshot is taken and the database is loaded
with data. At regular time intervals, T1, T2, and T3, a delta database or file is created
and the warehouse is refreshed. A delta contains only the changes made to operational
data that need to be reflected in the data warehouse.
• The warehouse fact data is refreshed according to the refresh cycle determined by
user requirements analysis.
• The warehouse dimension data is updated to reflect the current state of the
business, only when changes are detected in the source systems.
• The older snapshot of data is not removed, ensuring that the warehouse contains
the historical data needed for analysis.
• The oldest snapshots are archived or purged only when the data is not required any
longer.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-7
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
First-Time Load
•
Single event that populates the database with
historical data
•
•
•
Involves large volume of data
Employs distinct ETT tasks
Involves large amounts of processing after load
Operational
databases
T1
T2
T3
Copyright  Oracle Corporation, 1999. All rights reserved.
Refresh
•
•
•
•
•
Performed according to a business cycle
Simpler task
Less data to load than first-time load
Less-complex ETT
Smaller amounts of postload processing
Operational
databases
T1
T2
T3
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-8
Data Warehousing Fundamentals
Transporting Data into the Warehouse
.....................................................................................................................................................
First-Time Load and Refresh
First-Time Load The first time load (sometimes called an initial load) is a single
event that occurs prior to implementation. It populates the data warehouse database
with as much data as needed or available. The first-time load moves data in the same
way as the regular refresh. However, the complexity of the task is made greater due to:
• Data volumes that may be very large (Your company decides to load the last five
years of data, which may comprise millions of rows. The time taken to load the
data may be in days rather than hours.)
• Distinct extraction and transformation tasks that are applicable only to this older
data
• The task of populating all fact tables, all dimension tables, and any other ancillary
tables you may have created such as reference tables
• Postprocessing of loaded data, with tasks that must work on the large data
volumes, such as indexing and key generation
• Postload processing on large volumes of data, such as creating summary tables
With all the issues surrounding first time load, it is a task not to be considered lightly.
You must plan, prepare, and have recovery capabilities built in to your processing
routines to ensure success.
Refresh After the first time load, the refresh is performed on a regular basis
according to a cycle determined by users. The cycle may be daily, weekly, monthly,
quarterly, or any other business period. The refresh is a simpler task than first time
load for these reasons:
• There is less fact data to load. You are moving a new snapshot of data but not all
fact data into the data warehouse.
• There is no dimension data to load (unless your model has changed, which would
be an exception). There may be some dimensional data changes to incorporate.
• Less-complex extraction and transformation processes may be needed.
Additionally, because these processes are executed regularly, they can be
monitored, tested, and improved for each refresh until they run as optimally as
possible.
• Postload processing time is reduced and there is less new data to work with.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-9
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Building the Transportation Process
Specification
•
•
•
•
•
•
•
Techniques and tools
File transfer methods
The load window
Time window for other tasks
First-time and refresh volumes
Frequency of the refresh cycle
Connectivity bandwidth
Copyright  Oracle Corporation, 1999. All rights reserved.
Building the Transportation Process
•
•
•
•
•
•
Test the proposed technique
Document proposed load
Gain agreement on the process
Monitor
Review
Revise
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-10
Data Warehousing Fundamentals
Building the Transportation Process
.....................................................................................................................................................
Building the Transportation Process
Specifying the Process
You need to identify early on in the development process how you are going to move
the data from the source systems into the data warehouse. You must identify:
• The data movement techniques and tools available
• File transfer methods and transfer models available
• The time available to load the data into the warehouse—the load window
• Determine whether the time window is sufficient for other tasks such as backup,
preventative maintenance, and recovery, given expected performance metrics
• The volumes of data involved in the first time load and subsequent refreshes
• The frequency of the refresh cycle and the grain of the data
• Connectivity bandwidth
Testing the Process
You should test the proposed technique to ensure that volumes can be physically
moved within the load window constraints and network capabilities.
Documenting the Process
You must communicate and document the proposed load with the operations
organization to ensure their agreement and commitment to this important process.
Monitoring, Reviewing, and Revising the Process
You should ensure that the load is constantly monitored and reviewed, and revise
metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and
data granularity need regular revision.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-11
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Granularity
•
Important design and
operational issue
•
Space requirements
•
–
– Storage
– Backup
– Recovery
– Partitioning
Low-level grain
•
Expensive, high level
of processing, more
disk, detail
High-level grain
– Cheaper, less
processing, less disk,
little detail
– Load
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-12
Data Warehousing Fundamentals
Building the Transportation Process
.....................................................................................................................................................
Granularity
You have seen that the grain of the data is important in the warehouse environment.
The lower the level of granularity, the more data is loaded, and this affects the amount
of time taken to load the data into the warehouse.
Low-Level Grain Low-level grain data can be expensive to build and maintain. It
requires a large amount of processing power to process the details and provide answers
to business queries. It takes up more disk space and could create response time
problems. However the detail provides the information needed at a low level to give
sophisticated business analysis.
High-Level Grain High-level grain data is easier to build and maintain than low
level grain data. It requires less processing power and disk space, allows a higher
number of concurrent users to access data, and performs well. However, the lack of
detail and drill-down capability hinders definitive answers to business questions.
Note: The level of granularity affects not only the amount of direct access storage
devices (DASD) required for warehouse data, but also the amount of space required
for backup, recovery, and partitioning.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-13
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Transportation Techniques
•
•
•
•
•
•
•
Tools
Utilities and 3GL
Gateways
Customized copy programs
Replication
FTP
Manual
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-14
Data Warehousing Fundamentals
Transporting the Data
.....................................................................................................................................................
Transporting the Data
Now that you have seen how to capture the data needed for the refresh, consider how
to physically move the data to the warehouse server.
Transportation Techniques
These common techniques are used to transport data into the warehouse:
• Purchased ETT tools
• Proprietary data movement utilities that use COBOL, C, or Oracle SQL*Loader,
for example.
The fastest way to load large amounts of data into the warehouse is to use utilities
such as SQL*Loader that can access the database directly, use networks efficiently,
and run in parallel environments.
• Gateways, which may be vendor-specific or programmable, such as the Oracle
Transparent Gateways
• Customized copy programs which may employ COBOL, C, PL/SQL, and FTP
To a lesser degree these are also solutions:
• Replication (database)
• File Transfer Protocol (FTP) alone
• Manual shipping of the load medium to the data warehouse site
.....................................................................................................................................................
Data Warehousing Fundamentals
12-15
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Transportation Technique Considerations
•
•
•
Tools are comprehensive but costly.
Data-movement utilities are fast and powerful.
Gateways are not always the fastest method:
– Access other databases
– Supply dependent data marts
– Support a distributed environment
– Provide real-time access if needed
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-16
Data Warehousing Fundamentals
Transporting the Data
.....................................................................................................................................................
Transportation Technique Considerations
Purchased ETT Tools If your IT group has decided to use a customized ETT tool,
then it becomes the means by which your data is transported, as well as extracted and
transformed. This is not the most common option, particularly for early
implementations. Often, because of the cost, copy utilities are the logical alternative.
Data-Movement Utilities Oracle implementations use SQL*Loader, which is
capable of executing in parallel environments, running in a mode where server
intervention is minimized and performing limited transformations, such as merging
rows and changing data types. SQL*Loader is capable of loading very large volumes
of data in a relatively short time, and you can use it for first-time load and refreshes
successfully.
Gateways A gateway is a middleware component that presents a unified view of
data coming from different data sources. Of note are Oracle Transparent Gateways (or
Procedural Gateways), Open Database Connectivity (ODBC) tools, which present a
uniform view of a database other than an Oracle database, or a file on specific file
systems. Oracle gateways are a mixture of read-only, while other gateways are readwrite.
Access to Another Vendor’s Database You should consider using gateway
technology in specific instances only, and not on a regular basis. For example, using
gateway technology would allow you to access a database that is not an Oracle
database directly, without executing the usual extract programs. If the access is to
perform a simple SQL SELECT to access data that is to be processed for the
warehouse, this is faster than building a specific extract for the task.
Develop a Distributed Environment Gateway technology also gives you the
ability to develop warehouses on distributed environments, employing technologies
(hardware and software) that are not Oracle-specific.
Real-Time Data Access It is rare, but there are some data warehouse
implementations that are updated in real time. In this situation gateway technology is
useful because of the ease of executing remote queries. Consider using gateway
technology for this purpose only if it is specifically requested, and you can justify it.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-17
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Using SQL*Loader to Load Data
Input files
Log files
Control file
SQL*Loader
Bad files
Discard files
•
•
•
•
•
Fastest load mechanism
Direct path
Parallel and unrecoverable
Direct-load INSERT (Oracle8)
Direct-path load API (Oracle8i)
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-18
Data Warehousing Fundamentals
Transporting the Data
.....................................................................................................................................................
Using SQL*Loader to Load Data
The fastest way to load data is using SQL*Loader direct path, parallel, and
unrecoverable.
Direct Path Load Direct path load is optimized for maximum data loading
capability. Instead of filling a bind array buffer and creating INSERT commands,
direct path loads create data blocks in Oracle database block format. The blocks are
then written directly to the database. It makes calls to Oracle, but they are quick and
handled at the start and end of the load process. One direct path load can occur on a
table at any one time.
Direct-Path Load in Parallel You can run direct path loads in parallel. Parallel
loading can load massive amounts of data in short time frames. Use the PARALLEL
parameter. Note that conventional path load has the ability to perform parallel loads on
the same table, just like any other program or utility that uses SQL INSERT
statements.
Direct-Path Load in Parallel and Unrecoverable To avoid bottlenecks on redo
logs, switch on the UNRECOVERABLE option of SQL*Loader. There is no need to
write changes to redo logs in this environment.
Direct-Load INSERT In Oracle8, direct-load INSERT enhances performance
during insert operations by formatting and writing data directly into Oracle data files
without using the buffer cache. It has benefits over direct path load:
• Parallel load streams with a single failure do not flag the process to stop.
• The data is in Oracle format so the load does not have to convert data.
• It does not log redo information and can work in parallel.
Direct-Path Load API Oracle8i provides an application programming interface
(API) to the direct-path load mechanism in the Oracle Server. This API is described on
the next page.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-19
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Direct-Path Load API in Oracle8i
Load utility
•
•
•
•
Allows ETT and other tools to
load Oracle databases efficiently
Permits load behavior to be
customized
Gives direct-path load
performance
Provides complete access to all
direct-load functionality using
OCI
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-20
Data Warehousing Fundamentals
Transporting the Data
.....................................................................................................................................................
Using Direct-Path Load API in Oracle8i
Oracle8i provides an application programming interface to the direct path load
mechanism in the Oracle server. This provides a way for independent software
vendors and system management tool partners to create easy-to-use and highperformance customized data-loading tools. Access to all load functionality is
available through the API. Performance of any third-party data loading tool can
therefore be comparable to SQL*Loader.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-21
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
More Transportation Technique
Considerations
•
•
Use customized programs as a last resort
Replication is limited by data-transfer rates
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-22
Data Warehousing Fundamentals
Transporting the Data
.....................................................................................................................................................
More Transportation Technique Considerations
Customized Programs If you are employing Oracle for your warehousing
environment, SQL*Loader is recommended. Use customized programs only as a last
resort.
Replication Replication is rarely used in a data warehouse environment, because of
the limitations of data-transfer rates. It is normal to use SQL*Loader or in-housedeveloped loading techniques. If replication is used, it is more likely to be used to feed
data marts from a larger warehouse.
Note: Replication is not recommended for moving large volumes of data.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-23
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Postprocessing of Loaded Data
Browser:
http://
XX
+
H
Hollywo
ollywood
od
Cu
Browser:stom
http://
H
Hollywo
ollywood
od
a reco
rof
as
Browser:
C us
http://
Ho
llywo od
H ollywo
od tom
XX
er+
s:
er+s:XX
Loaded
data
Extract
Transform
Create
indexes
Transport
Generate
keys
Postprocessing
of loaded data
Summarize
Filter
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-24
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Postprocessing of Loaded Data
You have now seen how to extract data to an intermediate file store or staging area,
where it is:
• Transformed into acceptable warehouse data
• Transported to the warehouse server
You have also seen how the ETT process is slightly different for:
• First-time load, which requires all data to be loaded once
• Refreshing, which requires only changed data to be loaded
You now need to consider the different tasks that might take place once the data is
loaded. There are various terms used for these tasks. In this course the choice of terms
is postprocessing.
The post-processing tasks are not definitive; you may or may not have to perform
them, depending upon the volumes of data moved, the complexity of transformations,
and the transportation mechanism. For example, it is possible to load data using
SQL*Loader in a manner that excludes database trigger processing. However, at the
warehouse server you do want to ensure the triggers are executed so that the integrity
and validity of data are retained. This is referred to as postprocessing.
Four categories of postprocessing are explored on the following pages:
• Creating indexes
• Creating keys
• Creating summary tables
• Filtering
.....................................................................................................................................................
Data Warehousing Fundamentals
12-25
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Indexing Data
•
•
•
Before load: fast index reenablement
During load: adds time to load window
After load: adds time to load window
Index
Operational
databases
Staging
file
Warehouse
database
Copyright  Oracle Corporation, 1999. All rights reserved.
Unique Indexes
•
•
Disable constraints to load
Enable constraints to create index
Disable
constraints
Load
data
Enable
constraints
Create
index
Catch
errors
Reprocess
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-26
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Indexing Data
Before Indexing of data may occur prior to load. You can index the data values for
the warehouse after data cleansing and before transportation and load. You can retrieve
the data from a presorted list of values much more rapidly by reading the index, rather
than performing a full-table scan. This makes it easier to reenable indexes at the server
level. However, this is not done very often.
During It is possible to create the indexes at the same time as loading the data, using
the usual techniques employed by the server. However, this action is a row-by-row
approach to index creation, which lengthens the time to load data. In most cases the
time taken is too long, and for this reason the next option is preferable.
After It is common to index after the data has been loaded into the warehouse. This
adds time to the load window, but it is much faster than row-by-row processing, and
you can speed up the index creation process by indexing in parallel, in a parallel
environment.
Unique Indexes
If the index you are creating is an index that forces unique values in key columns with
database constraints, then it is usual to load the data with the database constraints
disabled, then enable the constraints. Then you build the index, which may find
duplicate values and fail. Ensure that the action catches the errors so that you can
correct and reindex.
Using SQL, you can employ the EXCEPTIONS INTO clause to catch errors.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-27
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Creating Artificial Keys
•
•
•
•
Use generalized or derived keys
Maintain the uniqueness of a row
Use an administrative process to assign the key
Concatenate operational key with number:
– Easy to maintain
– Cumbersome keys
– No clean value for retrieval
109908
10990801
Copyright  Oracle Corporation, 1999. All rights reserved.
Creating Unique Keys for Records
•
Assign a number from a list:
– No semantic meaning
– Extract operations must reference table to
assign numbers
109908
•
•
1
Update metadata
Verdict
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-28
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Creating Artificial Keys
An artificial (generalized or derived) key may be used to guarantee that every row in
the table is unique. The warehouse data may likely be a combination of many
transformed records, of which there are no natural data keys to use as unique
identifiers.
Concatenate Operational Key with a Number
Your postprocessing program executes the create index commands and allocates the
key values, which may be a concatenation of the primary key and version digit or
characters.
For example, if a customer record key value contains six digits, such as 109908, the
derived key may be 10990801. The last two digits are the sequential number generated
automatically.
Advantage The advantage of this method is that it is relatively easy to maintain and
set up the necessary programs to manage number allocation.
Disadvantage The disadvantages of this method are that
• The keys may become long and cumbersome.
• There is no clean key value for retrieval of a record, unless you have another copy
of the key. For example, if the operational Customer_Id is 109908 but the
warehouse key is now 10990801, then extracting information about that customer
from the warehouse using 109908 is impossible—unless the old value has been
retained in another field such as:
Customer_key Customer_id Customer_Name
10990801
109908
Acme Inc.
Assign a Number from a List
You can also assign the key sequentially from a simple list of numbers. A disadvantage
of this method is that the keys therefore have no semantic or intuitive meaning.
Metadata
You must ensure the metadata is updated to register the latest key allocations.
Verdict
The option you choose depends upon the extract methods, the tools available, and the
hardware and network capability and availability.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-29
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Creating Summary Tables
•
•
CTAS
pCTAS
Summary data
Copyright  Oracle Corporation, 1999. All rights reserved.
Filtering Data
From warehouse to data marts
•
•
CTAS
pCTAS
Summary data
Warehouse
Data marts
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-30
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Creating Summary Tables
This course has already discussed why summary tables are useful to the data
warehouse.
• They provide immediate answers to queries, which improves query performance.
• They save disk space. You can create summary data for old history for which
detailed analysis is not required.
After you perform initial user requirements analysis, you determine the summaries
needed by the user. However, you must constantly monitor access, from which you
may be able to determine new summaries that should be created and summaries no
longer needed.
You can create summaries by using:
• CREATE TABLE AS SELECT (CTAS), or
• CREATE TABLE AS SELECT... PARALLEL (pCTAS)
Filtering Data
You may filter out specific information to supply subject-specific data for dependent
data marts. The filtering uses simple SQL to create new objects using existing objects.
The new objects are then moved into the data mart, similar to the way data is moved
into the warehouse.
You can perform this filtering task using:
• CREATE TABLE AS SELECT (CTAS), or
• CREATE TABLE AS SELECT... PARALLEL (pCTAS)
.....................................................................................................................................................
Data Warehousing Fundamentals
12-31
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Verifying Data Integrity
•
•
Load data into intermediate file
Compare target flash totals with totals before load
Load
Counts
and
amounts
Flash
totals
File
1
=
Load
File
1
File
2
Intermediate
file
!=
Warehouse
File
2
Preserve,
inspect,
fix, then load
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-32
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Verifying Data Integrity
It is important at all stages of ETT that errors be detected, flagged, and resolved. How
you verify data integrity depends upon whether you have a customized approach to
ETT or whether you employ an ETT tool, which will probably deal with these issues
automatically, and only allow you to visibly access the data when available in the
warehouse.
It is important to ensure that each load, whether first time or a refresh, executes
successfully. You need to create jobs that track:
• The status of the warehouse load, whether it has started, is in progress, or complete
• When the process completes
• Statistics to show load start and complete time, and records processed in order to
monitor and ensure continuing efficiency
• Comparison of load control counts and amounts:
– You must be aware of the amounts of data that are to be loaded, so that you can
perform an accurate validation of completeness.
– You can load the detail and summary records into intermediate files, to
compare counts and amounts created before loading with counts and amounts
(flash totals) derived on the target data warehouse.
• Data reconciliation issues
• Referential integrity violations
• Any failures that require reprocessing
.....................................................................................................................................................
Data Warehousing Fundamentals
12-33
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Steps for Verifying Data Integrity
3
Source files
4
Control
1
SQL*Loader
Extract
6
2
5
.log
.bad
7
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-34
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Steps for Verifying Data Integrity
You may find it useful to load the detail and summary records into intermediate files,
so that you can compare record counts and sample totals before loading on the target
data warehouse. If the counts and totals do not match, you must preserve and inspect
the intermediate files without loading and compromising data warehouse data
integrity.
Example In the diagram, you see that the source data is coming in from a number of
files.
1 The control and extract process queries and downloads the data, and appends a row
(either a row count value or a phony row of unique data).
2 The process generates a report indicating the data extraction information, such as
the number of rows downloaded, the number of bytes in the file, and the query
statement.
3 The process puts the extracted data into a flat file.
4 SQL*Loader loads the data into a database table.
5 The conversion and loading process generates a loader log to track the same type
of information as the extract report: the number of rows downloaded, the number
of bytes contained in the file, and conversion details.
6 At the end of the load process, the SQL*Loader script removes the last record of
the flat file and puts it into a filename.bad file, which contains the row count
or phony record of data that was added by the extraction process.
7 A UNIX script compares the mainframe report and the loader log to see if they
contain the same information. The script may also look at the.bad file to
determine if the correct last row of data was removed from the loading process. If
the reports match and the data in the.bad file is correct, then the loading process
is deemed successful.
If you are writing a custom mechanism, embed a set of rows into the data so that
verification is easier. You can query for the embedded data to see that all rows are
loaded. Your routine may also display messages, which are embedded in the load
routine, or send an e-mail.
.....................................................................................................................................................
Data Warehousing Fundamentals
12-35
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Standard Quality Assurance Checks
•
•
•
•
•
•
•
Load status
Completion of the process
Completeness of the data
Data reconciliation
Violations
Reprocessing
1+1= 3
Comparison of counts and amounts
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-36
Data Warehousing Fundamentals
Postprocessing of Loaded Data
.....................................................................................................................................................
Standard Quality Assurance Checks
The following tasks are standard quality assurance checks for the data loaded into the
warehouse:
• Status of the warehouse load
• Completion of the load process
• Completeness of the data
• Data reconciliation
• Referential integrity violations and reprocessing
• Comparison of load control counts and amounts
.....................................................................................................................................................
Data Warehousing Fundamentals
12-37
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
•
•
•
First-time load considerations
Techniques for transporting data
Tasks involved in the postload processing stage
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-38
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
• Tasks involved with first-time loading of data into the warehouse
• Techniques for transporting data
• Tasks involved in the postload processing stage
.....................................................................................................................................................
Data Warehousing Fundamentals
12-39
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
Practice 12-1 Overview
This practice covers the following topics:
•
•
Identifying a series of statements as true or false
Answering a series of questions
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
12-40
Data Warehousing Fundamentals
Practice 12-1
.....................................................................................................................................................
Practice 12-1
1 Assemble into small groups of 3 or 4. Discuss and compare the factors that will
determine the load window where you work. Consider user requirements,
operational constraints, and staffing issues.
2 Identify whether the following statements are true or false.
Question
True
False
Transportation of data involves moving the data into the
data warehouse database.
An example of high level grain data is summarized data.
SQL*Loader is the fastest way to move data into the data
warehouse database.
Gateways are useful for moving large amounts of data into
the warehouse.
Data for the data warehouse is always indexed after it is
loaded.
The quickest way to create unique indexes on warehouse
data is to leave database constraints enabled on load.
Summary tables are created on the warehouse server.
Filtering removes unwanted records from staging files.
3 Name the two different types of data loading.
_____________________
_____________________
4 Name four methods of moving data to the warehouse server.
_____________________
_____________________
_____________________
_____________________
5 What SQL command is used to create summary tables on the data warehouse
server?
________________________________________________________________
6 What server technique can be used to prevent and allow access to data in the
warehouse after refresh?
________________________________________________________________
.....................................................................................................................................................
Data Warehousing Fundamentals
12-41
Lesson 12: Transportation: Loading Warehouse Data
.....................................................................................................................................................
.....................................................................................................................................................
12-42
Data Warehousing Fundamentals
13
.................................
Transportation:
Refreshing Warehouse
Data
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Choosing a
Computing
Architecture
Meeting a
Business
Need
Modeling
the Data
Warehouse
Analyzing
User Query
Needs
Planning
Warehouse
Storage
ETT
ETT
(Building
(Building
the
the
Warehouse)
Warehouse)
Managing
the Data
Warehouse
Supporting
End User
Access
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
•
•
•
Describe methods for capturing changed data
•
List tools for transporting data into the warehouse
Explain techniques for applying the changes
Discuss techniques for purging and archiving data
Outline final tasks, such as publishing the data,
controlling access, and automating processes
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
In the last lesson, you examined the first time load of the warehouse. In this lesson,
you examine methods for updating the warehouse with changed data, after the first
time load.
Note that the “ETT (Building the Warehouse)” block is highlighted in the overview
slide on the facing page.
Objectives
After completing this lesson, you should be able to do the following:
• Describe methods for capturing changed data
• Explain techniques for applying the changes
• Discuss techniques for purging and archiving data
• Outline final tasks, such as publishing the data, controlling access, and automating
processes
• List tools for transporting data into the warehouse
.....................................................................................................................................................
Data Warehousing Fundamentals
13-3
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Developing a Refresh Strategy
for Capturing Changed Data
•
•
•
•
•
•
Consider load window
Identify data volumes
Identify cycle
Know the technical infrastructure
Plan a staging area
Determine how to detect changes
Operational
databases
T1
T2
T3
Copyright  Oracle Corporation, 1999. All rights reserved.
User Requirements and Assistance
•
•
•
•
Users define the refresh cycle
IT balances requirements against technical issues
Document all tasks and processes
Employ user skills
Operational
databases
T1
T2
T3
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-4
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Capturing Changed Data
You must have a strategy for maintaining changes to the data warehouse, including
changes to facts, dimension data, and summary data.
There are no concrete rules about when the data warehouse should be refreshed, but
there are several factors to consider:
• Total load window available
• The volume of data to be transferred
• How often does the warehouse data need to be updated? When are you going to
move the data? Will you refresh monthly, weekly, or at another time interval? Will
you use continuous refresh for nearly real-time data?
• The connectivity gear available for moving the data into the data warehouse. How
are you going to move the data? Will you move data in batch mode, which is
feasible for less time-critical applications?
• Are you going to move data from operational systems to an intermediate area? Is
this area an operational data store? Is it a flat file? Is it an Oracle database? Or is it
something completely unique to your implementation?
• How are changes in data to be detected? Are you going to push the changes
through when detected? Are you going to pull the changes in? Where are you
going to store the changes? Could you use triggers to force changes into an
alternative store?
User Requirements and Assistance
The strategy is primarily defined by user requirements, but they must be balanced
against the available technology and windows for loads. All must be documented and
understood by everyone involved in the project. The users can also provide expertise
for load verification, validation, run-to-run, and load controls.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-5
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Load Window
•
•
•
•
•
Time available for entire ETT process
Plan
Test
Prove
Monitor
Load Window
0
3 am
6
User Access Period Load Window
9
12 pm
3
6
9
12
Copyright  Oracle Corporation, 1999. All rights reserved.
Load Window
•
•
•
•
•
•
Plan and build processes according to a strategy.
Consider volumes of data.
Identify technical infrastructure.
Ensure currency of data.
Consider user access requirements first.
High availability requirements may mean a small
load window.
User Access Period
0
3 am
6
9
12 pm
3
6
9
12
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-6
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Load Window
The load window is simply the amount of time you have available to extract,
transform, load, postload process data, and make the data warehouse available to the
user. The load performs many sequential tasks that take time to execute.
You must ensure that every event that occurs during the load window is planned,
tested, proven, and constantly monitored. The effect of poor load performance is to
extend the load time and prevent users from accessing the data when it is needed.
Careful planning, defining, testing, and scheduling is critical.
Load Window Strategy The load time is dependent upon a number of factors, such
as data volumes, network capacity, and load utility capabilities. You must not forget
that the aim is to ensure the currency of data for the users, who require access to the
data for analysis. To work out an effective load window strategy, consider the user
requirements first, and then work out the load schedule backward from that point.
Determining the Load Window It is usual to define the user access requirements
first and work the load schedule backward from that point. Once the user access time
is defined, you can establish the load cycles. Some of the processes overlap to enable
all processes to run within the window.
More realistically, almost twenty-four-hour access is required. This means the load
window is significantly smaller than the example shown here. In that event, you need
to consider how to process the load and keep users presented with current realistic
data. This is where you can use partitioning strategies.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-7
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Scheduling the Load Window
1
Requirements
2
Load cycle
Receive
data
File
1
FTP
File
2
0
3
Control File
File names
File types
Number of files
Number of loads
First-time load or refresh
Date of file
Date range
Records in file - counts
Totals - amounts
Control
4
process
Open
and
read
files
to
verify
and
analyze
3 am
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-8
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Scheduling the Load Window
From the example you can see that the transportation of data (that is, moving the data
to the server and loading into the warehouse tables) is a complex task involving many
steps.
To work out an effective load window strategy, consider the user requirements first,
and then work out the load schedule backward from that point.
Example of Scheduling the Load Window
1 Determine when the users require the data. If the working hours are between 9 a.m.
and 5 p.m., you allow them access during that period.
2 Once the user data-access time is defined, you can establish the load cycle. The
load cycle may need to access different extract files, or a different number of
extract files, each time the load is performed. You may need to split the cycle into
a series of loads using one file at a time.
3 You create a control file to manage every load, or series of loads. Remember that
the first-time load is different from refreshes, and that for each refresh the files and
number of files may differ.
The control file contains information such as the:
– File name and type
– Date of the file
– Number of records in the file
– Date range for the data in the file
– Counts of records and totals so that the data load can be verified
4 The control process is an active process that waits for the files named in the control
file to be received. The number and names of these files vary among loads. Files
are usually transferred using File Transfer Protocol (FTP) techniques. The control
process does not pass to any other process until all files are received and it has
opened and read count and amount data to be used for load verification and
analysis.
Note: The time 0 identified on the slides denotes 00:00 Zulu, which is midnight.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-9
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Scheduling the Load Window
6
Verify,
analyze,
reapply
5
Load into
warehouse
7
Index
data
8
Create
summaries
9
Update
metadata
File
1
File
2
Parallel
load
3 am
6 am
9 am
Copyright  Oracle Corporation, 1999. All rights reserved.
Scheduling the Load Window
11
10
Create
views for
specialized
tools
Back up
warehouse
12
Users
access
summary
data
13
Publish
User access
6 am
9 am
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-10
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Example of Scheduling the Load Window (continued)
5 The data is then loaded into the warehouse.
6 Each load requires verification and analysis (and maybe reanalysis once any load
exceptions are reapplied). You need to ensure that the data is successfully loaded
by performing checks against the row counts and amounts available in the control
files.
Any loading errors yielding potentially bad data need to be reapplied. This adds
time to the load, and contingency should be built into the cycle to cope with this. If
you are using SQL*Loader to move the data, the bad data resides in a file called
<filename>.bad.
7 Indexes are constructed.
8 Summarization takes place.
9 Metadata is updated to ensure it contains information about the current load.
10 The warehouse is backed up. With many database servers today, there are typically
two mechanisms for backup: hot, with users online, and cold, with users offline.
You should consider cold backups before user access. The backup should include:
– All warehouse data
– Summary tables
– Database schema
– Metadata
Note: If the information is supplied to the warehouse on tape, a full cold backup
may not be necessary. The summaries created at the target server may be all that
you need to back up.
11 Create the views required by specialized user tools, such as Oracle Express
RAM/RAA.
12 Give users access to the summary data.
13 Publish information to the users, specifying the changes to the data warehouse and
allowing them access.
Note: These steps identify one solution and assume that summarization and indexing
occur after load, and that the job is executed from a batch file.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-11
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Capturing Changed Data for Refresh
•
•
•
•
Capture new fact data
Capture changed dimension data
Determine method for capture of each
Methods:
– Wholesale data replacement
– Comparison of database instances
– Time stamping
– Database triggers
– Database log
•
Hybrid techniques
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-12
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Capturing Changed Data for Refresh
There are two major categories of changed data:
• New fact data
• Changed dimension data
For each, a different capture mechanism will be discussed.
In addition, consider how you will process the load. The fact might easily be loaded by
adding another partition of data, a relatively straightforward process (for a database
administrator).
Changes to dimension data need more selective update. You need to evaluate whether
the change is to replace or add to an existing record, or whether you want to maintain
history (keeping old and new records).
• For example, the description of a product may change over its lifetime, even if its
primary (and unique) part number remains the same. It is important to see that
change reflected.
• Another common example is sales districts in a sales organization that reorganizes.
Methods
There are a number of ways to capture changes to data. Consider which is the most
efficient for your individual circumstances:
• Wholesale data replacement
• Comparison of database instances
• Time and date stamping
• Database triggers
• Database log
Note: All methods identified here are possible with Oracle server and associated
facilities and utilities.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-13
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Wholesale Data Replacement
Operational
databases
T1
•
•
•
•
T2
T3
Expensive
Limited historical data, if any
Data mart implementations
Time period replacement
Copyright  Oracle Corporation, 1999. All rights reserved.
Comparison of Database Instances
Yesterday’s
operational
database
Today’s
operational
database
Database
comparison
Delta file holds
changed data
•
Simple to perform, but expensive in time and
processing
•
Delta file:
– Changes to operational data since last refresh
– Used by various techniques
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-14
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Wholesale Data Replacement
This method refreshes the entire warehouse in every business cycle. This method is
understandably very expensive. Every refresh needs to extract, transform, and
transport the entire warehouse. In fact, this method is similar to using a first-time load
on a regular basis.
Some data mart and online analytical processing server implementations use this
method because they hold less data (a subset of the data warehouse), and wholesale
replacement is less complex and less expensive than programming mirroring and
update procedures.
Issues The time window required for wholesale replacement can often exceed the
time that the data is contracted to be offline (and unavailable to the users). However,
with mirroring strategies users can be directed to an image copy of the data warehouse
while maintenance is being performed. The changes that occur during the maintenance
cycle must be applied to the current online image (production version). The production
version should then be backed up or mirrored.
Historical data analysis is limited, because you are restricted by the sheer volume of
data loaded each time.
Comparison of Database Instances
In this method, you capture the differences between two instances of the same
database, to find out the changes that have occurred since the last time the data
warehouse was refreshed. The changes are held in an intermediate (or delta) file and
are used to update the warehouse.
Issues It is a simple but an expensive way to determine changes. It works more
efficiently and effectively if the volumes of data are small, as with wholesale
replacement.
Delta File or Database The delta database (or file) contains only the changes that
have been made to the operational system since the last refresh. An operational
application may need to be modified to create the delta file structure and contain the
new logic that captures changes and adds the rows to the delta file.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-15
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Time and Date Stamping
Delta file holds
changed data
Operational
data
•
Fast scanning for records changed since last
extraction
•
•
Date Updated field
No detection of deleted data
Copyright  Oracle Corporation, 1999. All rights reserved.
Database Triggers
Operational
server
(DBMS)
Operational
data
Trigger
Trigger
Delta file holds
changed data
Trigger
Triggers on server
•
•
•
Changed data intersected at the server level
Extra I/O required
Maintenance overhead
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-16
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Time and Date Stamping
A time and date stamp on changed data quickly shows you the data that has been
changed since the last refresh cycle. The time and date stamp is normally part of a key
value, making it an efficient way to search and find changed data.
The advantage of this approach is that the process that creates the delta database only
needs to look at the time key and identify the records with the required time and date
stamp. Depending upon the frequency of refresh and the mechanism chosen for time
and date stamping, the search for the time value may be a specific date, for example,
all Time_Key = ‘01-jan-97’, or a date range such as Time_Key BETWEEN ‘01-jan97’ and ‘07-JAN-97’, or Time_Key LIKE ‘%jan-97’.
Issues You can use this method only if the database contains a Date Updated field,
which may not be the case in many operational systems. This is one issue that may be
resolved by reengineering source system applications or database server code. You
might add database triggers to perform the updates.
Note: Time and date stamps do not catch deleted data.
Database Triggers
Procedural code in database triggers captures and identifies changed data at the
database level. Extra I/O is required while the system is online to track changes as they
occur and maintain a delta file if needed.
Issues You must modify the database to add server (DBMS) triggers that capture
before and after images of the records.
The triggers and associated code—PL/SQL, if using Oracle—write the changes to a
delta database or file. Of course, to use this method, the server must support database
triggers.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-17
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Using a Database Log
Operational
server
(DBMS)
Operational
data
•
•
•
Log
Log analysis
and
data extraction
Contains before and after images
Delta file holds
changed data
Requires system checkpoint
Common technique
Copyright  Oracle Corporation, 1999. All rights reserved.
Verdict
•
•
Consider each method on merit.
•
Consider current technical, existing operational,
and current application issues.
Consider a hybrid approach if one approach is not
suitable.
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-18
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Using a Database Log
A log file contains information from which you can extract changed data; it logs
“before” and “after” images of the data. You may analyze the log file in batch mode to
identify the differences that become the delta file.
Issues
• The format of the log file may be difficult to interpret and use.
• The log tape is not really intended for use by the warehouse, and often contains a
lot of data not required by the warehouse.
• The system must wait for a checkpoint in order to get a stable log.
This is a process that many ETT tools use, but it can be done only on databases that
provide a log, such as Oracle and DB2.
Note: Oracle snapshot and replication facilities log changes into another table.
Verdict
Each of these mechanisms has its good and bad points. In reality, your data
warehousing environment might actually use a combination of these mechanisms. For
example, you might:
• Time-stamp changed dimension data, and
• Simply extract data that exists within a database partition for the new fact data, but
use
• Wholesale replacement to supply your dependent data marts with updated data.
The choice you make is based on the many factors identified earlier in this lesson.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-19
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Applying the Changes to Data
You have a choice of techniques:
•
•
•
•
•
Overwrite a record
Add a record
Add a field
Maintain history
Add version numbers
Copyright  Oracle Corporation, 1999. All rights reserved.
Overwriting a Record
Customer Id
John Doe
Single
...................................................................,
...............................................................,....
Customer Id
John Doe
Married
......................................................................
......................................................................
•
•
•
Easy to implement
Loses all history
Not recommended
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-20
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Applying the Changes to Data
There are a number of methods for managing changes to existing data in dimension
tables:
• Overwrite a record
• Add a new record
• Add a current field
• Maintain history records
• Versioning of records
Overwriting a Record
This method is easy to implement, but it is useful only if you are not interested in
maintaining the history of data. If the data you are changing is critical to the context of
information and analysis of the business, then overwriting a record is to be avoided at
all costs.
For example, by overwriting dimension data, you lose all track of history—you can
never see that John Doe was single if the value “Single” is overwritten with the value
“Married” from the operational system. The Customer_Id for John Doe remains
constant throughout the life of the warehouse, because only one record for John Doe is
stored.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-21
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Adding a New Record
1 Customer Id John Doe Single
1 Customer Id John Doe Single
1A Customer Id John Doe Married
•
•
•
•
History is preserved; dimensions grow.
Time constraints are not required.
Generalized key is created.
Metadata tracks usage of keys.
Copyright  Oracle Corporation, 1999. All rights reserved.
Adding a Current Field
•
•
•
Customer Id
John Doe
Single
Customer Id
John Doe
Single
Married 01-JAN-96
Maintains some history
Loses intermediate values
Is enhanced by adding an Effective Date field
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-22
Data Warehousing Fundamentals
Capturing Changed Data
.....................................................................................................................................................
Adding a New Record
Using this method, you add another dimension record for John Doe. One record shows
that he was “single” until December 31, 1995, another that he was “married” from
January 1, 1996. Using this method history is accurately preserved, but the dimension
tables get bigger.
• A generalized (or artificial) key is created for the second John Doe record.
• The generalized key is a derived value that ensures that a record remains unique.
However, you now have more than one key to manage.
• You also need to ensure that the record keeps track of when the change occurred.
The Customer_Id for John Doe does not remain constant throughout the life of the
warehouse, because each record added for John Doe contains a unique key. The key
value is usually a combination of the operational system identifier with characters or
digits appended to it.
Consider using real data keys. The example here shows a method that is commonly
identified in warehouse reference material.
Adding a Current Field
In this method, you add a new field to the dimension table to hold the current value of
the attribute. Using this method, you can keep some track of history. You know that
John Doe was “single” before he was “married”. Each time John’s marital status
changes, the two status attributes are updated and a new Effective Date is entered.
However, what you cannot see from this method is what changes have taken place
between the two records you are storing for John Doe—intermediate values are lost.
• Consider using an Effective Date attribute to show when the status changed.
• Partitioning of data can then be performed by effective date.
The method you choose is again determined by the business requirements. If you want
to maintain history, this method is a logical choice that can be enhanced by using a
generalized key.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-23
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Limitations of Methods for Applying
Changes
•
•
•
Complete history impossible
Dimensions may grow large
Maintenance overhead
1234 Comer
1234 Comer
1 Main Street
200 First Ave
555-6789
222-3211
1234
Comer
123401 Comer
1 Main Street
200 First Ave
555-6789
222-3211
1234
Comer
123401 Comer
1 Main Street
200 First Ave
Effective Date
555-6789 01-Apr-93
222-3211 01-Jun-97
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-24
Data Warehousing Fundamentals
Limitations of Methods for Applying Changes
.....................................................................................................................................................
Limitations of Methods for Applying Changes
Assume a customer record as follows:
Custid
1234
Name
Comer
Address
1 Main Street
Phone
555-6789
If you overwrite the record, history is lost, and there is no record of this company ever
existing at 1 Main Street.
Custid
1234
Name
Comer
Address
200 First Ave
Phone
222-3211
You may add a record and create a generalized key to identify the row uniquely.
However, this method may make the dimension large and unmanageable and you have
lost that customer’s unique identifier.
Custid
1234
123401
Name
Comer
Comer
Address
1 Main Street
200 First Ave
Phone
555-6789
222-3211
You also have to duplicate the fields for this customer that have not changed into the
record with the new generated key, which adds to the maintenance burden.
You may add a current field and create a generalized key to uniquely identify the row:
Custid
123401
Name
Comer
Address
200 First Ave
Phone
555-6789
Effective Date
01-jun-97
In this situation, you know that 200 First Ave. is the current address, but you have no
way of knowing the previous address details.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-25
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Maintaining History
HIST_CUST
CUSTOMER
Time
Sales
One-to-many relationship
•
•
Product
Always retain current record
Consistently able to refer to record history
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-26
Data Warehousing Fundamentals
Limitations of Methods for Applying Changes
.....................................................................................................................................................
Maintaining History
Another alternative is to use history tables, which involve normalizing the dimensions
to hold current and historical data. Oracle consultants engaged in data warehouse
implementations have found this method to be a more comprehensive, effective, and
easily managed solution.
One-to-Many Relationship
Using this method, you keep one current record of the customer and many history
records in the customer history table (a one to many relationship between the tables),
thus maintaining history in a more normalized data model. The table below shows you
how the data might appear.
In the CUSTOMER table the customer operational unique identifier is retained in the
CUSTOMER.Id column. In the HIST_CUST table, the operational key is maintained
in the HIST_CUST.Id column and the generalized key in the HIST_CUST.G_id
column. This enables you to keep all the keys needed and multiple records for the
customer.
CUSTOMER. Id
1234
HIST_CUST. Id
1234
4567
5678
4567
5678
HIST_CUST. G_id
1234
1234A
1234B
4567
5678
5678A
5678B
The CUSTOMER table may contain full details for each customer; however, it could
contain only the key values, leaving the full details (including text descriptions) in the
HIST_CUST table.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-27
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
History Preserved
•
•
•
•
History enables realistic analysis.
History retains context of data.
History provides for realistic historical analysis.
Model must be able to:
– Reflect business changes
– Maintain context between fact and dimension
data
– Retain sufficient data to relate old to new
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-28
Data Warehousing Fundamentals
Limitations of Methods for Applying Changes
.....................................................................................................................................................
History Preserved
This method completely preserves history and is therefore very effective for
performing analysis over time where data has changed substantially. The context of
information is still preserved. A good example of where this applies is in a sales
organization.
Assume that you have a model containing a sales fact and dimensions such as
Customer, Sales Region, and Product.
Your warehouse contains sales figures for sales region Europe for the years 1992 and
1993. In 1994, the European region reorganizes and splits into East Europe and West
Europe. The warehouse is now maintaining data for each region from 1994 onward.
In 1997, users are asked to put together some projections based on the last five years’
sales in Europe. The data you are currently using for East and West Europe for 1992
and 1993 does not have the data split this way. That is not an issue because you still
have the ability to roll up East and West regions into a total for Europe, and perform
analysis over a five-year period.
If we reverse the scenario, two regions become one and the solution is the same.
The issue with retaining history and context is building a model that is able to:
• Reflect changes as the business changes
• Keep the context of information accurate between dimension and fact data
• Retain sufficient data to be able to relate old and new records where needed
.....................................................................................................................................................
Data Warehousing Fundamentals
13-29
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Version Numbering
•
•
Avoid double counting
Facts hold version number
Customer
Time
Sales
Customer.CustId Version
1234
1
1234
2
Customer Name
Comer
Comer
Sales.CustId
1234
1234
Sales Facts
11,000
12,000
Version
1
2
Product
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-30
Data Warehousing Fundamentals
Limitations of Methods for Applying Changes
.....................................................................................................................................................
Version Numbering
You can also maintain a version number for the customer in the Customer dimension:
Custid
1234
1234
Name
Comer
Comer
Address
1 Main Street
200 First Ave
Version
1
2
You must ensure that the measures in the fact table, such as sales figures, also contain
the customer version number to avoid any double counting:
Custid
1234
1234
1234
1234
1234
1234
1234
Version
1
2
1
1
2
2
1
Sales $
11,000
12,000
5,000
10,000
45,000
30,000
10,000
For Comer Version 1, the sales total is $36,000.
For Comer Version 2, the sales total is $87,000.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-31
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Purging and Archiving Data
•
•
As data ages, its value depreciates.
Remove old data from the warehouse:
– Archive for later use
– Purge without copy
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-32
Data Warehousing Fundamentals
Purging and Archiving Data
.....................................................................................................................................................
Purging and Archiving Data
Data may reside in the warehouse for many more years than it would in an operational
system; however, it does not remain forever. The value of data to the business
diminishes over time.
During analysis, the analysts determine the useful life span of the data. In addition, old
data may simply be summarized; the detail is not needed.
What Is Purge?
If there is no chance of ever needing the data again, even for summaries, then you can
purge it. This removes the data entirely; no copy is retained.
What Is Archive?
If you feel you may need the data in the future—to build summaries, for example—
then archive the data to low-cost storage devices that are not associated with the data
warehouse.
Your Role
You need to ensure that you have the strategies in place that meet determined business
requirements for purge and archive.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-33
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Techniques for Purging Data
•
•
•
•
TRUNCATE: Retains no rollback
DELETE: Retains redo and rollback
ALTER TABLE: Removes a partition
PL/SQL: Uses database triggers
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-34
Data Warehousing Fundamentals
Purging and Archiving Data
.....................................................................................................................................................
Techniques for Purging Data
TRUNCATE Command The SQL TRUNCATE command is the quickest way to
purge data. It does not retain redo data and rollback is impossible. It is also useful for
emptying a temporary table that is used repeatedly as part of a regular load or
summary process. Indexes on the table are also truncated.
DELETE Command The SQL DELETE command is used if the data has not been
partitioned. DELETE retains redo information, so you need to size the rollback
segments carefully. NOLOGGING does not apply to DELETE or UPDATE. DELETE
works only in parallel on partitioned tables. Oracle8 syntax enables you to delete rows
from a partition.
When you delete rows from a table, the corresponding entries in every index on the
table must also be deleted. This has a performance impact.
ALTER TABLE Command Given that your warehouse data is commonly
partitioned by time, you can simply remove a partition containing old data.
PL/SQL Triggers Where there are special requirements and low volumes of data,
you can use PL/SQL and the ON DELETE database trigger. This is, however, an
expensive option.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-35
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Techniques for Archiving Data
•
•
•
Export to dump file from tables
Import to tables from dump file
ALTER TABLE EXCHANGE partitions
EXP
Database
.dmp
IMP
Copyright  Oracle Corporation, 1999. All rights reserved.
Verdict
•
•
Defined by business requirements
Must be managed
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-36
Data Warehousing Fundamentals
Purging and Archiving Data
.....................................................................................................................................................
Techniques for Archiving Data
Import and Export Utilities The export utility enables you to move data from
tables to a dump file (called filename.dmp). The import utility can then read that
dump file and load data back into the same or another user.
You can export in two ways:
• A conventional path export uses a conventional SELECT statement to extract table
data which is held for a short time in an evaluation buffer. Once evaluated, it is
transferred to the dump file.
• A direct path export does not use the evaluation buffer.
ALTER TABLE You can also switch a partition of data with an empty table, drop
the empty partition, and export the table. Archive the exported table when you have
time.
Verdict
The method you employ depends upon your individual business requirement, although
the history model is a popular choice in the current warehousing environment. You
must ensure that someone in the data warehouse administration is responsible for
managing and tracking these changes.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-37
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Final Tasks
Update metadata
– ETT
Browser:
Cus
http://
tom
Hollywood
X
+ s:
er
Cus
Browser: Browser:
to
h ttp://
http:// mer
+X
Hol lywood Hollywoo d s:
X +
a reco
rof
as
•
– User
Sources
•
Publish data
Stage
Extract
Rules
Transform
Publish
Load
Query
– Availability
– Changes
– Subject area basis
•
Use database roles to prevent and allow access
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-38
Data Warehousing Fundamentals
Final Tasks
.....................................................................................................................................................
Final Tasks
Update the Metadata
Once your data has been loaded successfully, ensure that the metadata is updated. You
need to consider many aspects, including information about the processes themselves.
The most important aspect at this time is to ensure that the metadata reflects the new
information available. Users must be made aware of the changes, for example, of the
validity of data, date of data, any new data available, revised summaries, removed
summaries, new algorithms, and the new meaning of values.
Publish Data
So that users are presented with a consistent view of the data, ensure that user access is
denied while the ETT processes are executing. You should allow access only when all
tasks are complete, validation has occurred, and metadata updated.
You may choose to do this on a subject area basis, user basis, or for the entire
warehouse. Again, like many other tasks, this is dependent upon your individual data
warehouse or data mart implementation.
Accessing the Refreshed Warehouse With Oracle, using roles and granting and
revoking privileges is the simplest method of preventing and allowing access.
You may advise the users that the warehouse is available by internal e-mail
mechanisms. Alternatively, if you have strict service level agreements (SLAs) that
state users must have access from, say, 8:00 a.m. every working day, then advice may
not be needed. You could e-mail or advise only if the warehouse is not available,
because of some unforeseen problems.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-39
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Publishing Data
•
•
•
•
Control access using database roles
24-hour operation may be requested
Compromise between load and access
Consider
– Staggering updates
– Using temporary tables
– Using separate tables
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-40
Data Warehousing Fundamentals
Final Tasks
.....................................................................................................................................................
Publishing Data
The term “publishing data” is used to describe when the data is loaded and made
available to the users.
As a rule you prevent access to the data while the load process is active, to ensure that
the users are presented with an accurate view of data and summaries.
If service level agreements state that users require access virtually 24 hours a day, then
revoking and granting access as discussed is not appropriate. You need to consider
how you can perform the load action while still allowing access, and ensuring that the
data is as consistent as possible.
There are different techniques depending upon the availability needs of the users.
• Stagger the updates to the different subject areas. Update on different nights of the
week (say Tuesday and Wednesday) even though the revised source data might be
made available days earlier.
• Use temporary tables (that the users cannot access) for load, filtering,
summarizing. Make the database unavailable only for the short time it takes to
instantiate these as permanent objects.
• Load the data into a separate table and perform all the processing required. These
actions are invisible to the user. Then when all tasks are complete, swap the
contents of the temporary table into a database partition. The same technique is
employed for the indexes.
Note: With Oracle7, the partition is a view. In Oracle8, this is a partitioned table.
.....................................................................................................................................................
Data Warehousing Fundamentals
13-41
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
ETT Tool Selection Criteria
•
•
•
•
•
•
•
•
•
•
•
Overlap with existing tools
Availability of meta model
Supported data sources
Ease of modification and maintenance
Required fine tuning of code
Ease of change control
Power of transformation logic
Level of modularization
Power of error, exception, resubmission
features
Intuitive documentation
Performance of code
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-42
Data Warehousing Fundamentals
Selecting ETT Tools
.....................................................................................................................................................
Selecting ETT Tools
Consultants in the field suggest that the selection criteria for ETT tools include the
following considerations:
• The overlap with existing tools used in the warehouse development, such as Oracle
Designer or other modeling tools
• The availability of the metamodel to other tools or the use of the metamodel from
other tools
• The breadth of data sources supported and target data coverage, such as flat files,
character formats, and database types
• The mechanism for and ease of defining and altering rules when there is possibly a
mixed set of users managing ETT, such as analysts and endusers
• The requirement to maintain generated code manually
Some vendors advise there is no need to modify the generated code; however, you
may need to fine-tune it. Do you have the in-house expertise to modify the
generated code, for example, C or COBOL?
• The control of changes to transformation rule definitions and the ability to handle
development and production versions of transformation rules
• The depth, power, and ease of use for the transformation logic; for example,
conditional logic, data value filters, row and set-oriented processing, local
variables, and input parameters
• The reuse and modularization of existing transforms and filters
• Error reporting, rejected records, and resubmission capabilities
You need to be able to trap and correct bad data before it is loaded into the
warehouse and report corrections to the source system afterward.
• The self-documenting ability
If the tool is text-based, and not intuitive to navigate, you are going to find it
difficult to get the entire picture of the processing performed within the warehouse.
A graphical design tool is desirable.
• The performance of generated code
.....................................................................................................................................................
Data Warehousing Fundamentals
13-43
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
ETT Tool Selection Criteria
•
•
•
•
•
•
Activity scheduling and sophistication
Metadata generation
Learning curve
Flexibility
Supported operating systems
Cost
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-44
Data Warehousing Fundamentals
Selecting ETT Tools
.....................................................................................................................................................
Selecting ETT Tools (continued)
• Activity scheduling
Can the tool schedule actions to happen and retry if the source or target is not
available? Can it report what it has done?
• Scheduling sophistication
Can it schedule based on time of day, time since last try, time since last success,
and time period regardless of last attempt?
• Metadata generation by the transformation tool
Generated metadata should be intuitive and easily understood by the business user.
• The learning curve of the tool
• The flexibility of the tool
• The operating system under which the tool runs
Is it supported on all the platforms that you will use for the ETT process?
• Cost
.....................................................................................................................................................
Data Warehousing Fundamentals
13-45
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Transportation Tools
•
•
Informatica
OpenBridge
Oracle
SQL*Loader
Gateways
PL/SQL
Precompilers
•
Platinum Technology InfoPump
Platinum Info Transport
Copyright  Oracle Corporation, 1999. All rights reserved.
Replication Server Utilities
•
Oracle
Symmetric and Heterogeneous
Replication
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-46
Data Warehousing Fundamentals
Selecting ETT Tools
.....................................................................................................................................................
Transportation Tools
WTI Partner
Informatica Corp.
Oracle
Platinum Technology, Inc.
Product
OpenBridge
SQL*Loader—Direct Path, Direct Path in Parallel
Transparent and Procedural Gateways
PL/SQL
Precompilers
InfoPump
Platinum Info Transport
Replication Server Utilities
WTI Partner
Oracle
Product
Symmetric and Asymmetric Replication
Heterogeneous Replication
.....................................................................................................................................................
Data Warehousing Fundamentals
13-47
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Gateways and Middleware
•
•
•
•
•
•
•
Brio Technology
DataPrism
Informatica Corporation
OpenBridge
Information Builders
EDA/SQL
Oracle
Gateways
Platinum Technology
InfoHub
Prism
Prism Manager
Software AG
Entire Transaction
Propagator
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-48
Data Warehousing Fundamentals
Selecting ETT Tools
.....................................................................................................................................................
Gateways and Middleware
WTI Partner
Brio Technology
Informatica Corp.
Information Builders, Inc.
Oracle
Platinum Technology, Inc.
Prism
Software AG of North America
Product
DataPrism
OpenBridge
EDA/SQL
Oracle Open Gateways
Procedural Gateways
SQL*Loader
InfoHub
Prism Manager
Entire Transaction Propagator
.....................................................................................................................................................
Data Warehousing Fundamentals
13-49
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
•
•
•
•
Capturing changed data
•
Identifying tools for transporting data into the
warehouse
Applying the changes
Purging and archiving data
Publishing the data, controlling access, and
automating processes
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-50
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
• Capturing changed data
• Applying the changes
• Purging and archiving data
• Publishing the data, controlling access, and automating processes
• Identifying tools for transporting data into the warehouse
.....................................................................................................................................................
Data Warehousing Fundamentals
13-51
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
Practice 13-1 Overview
This practice covers the following topics:
•
•
Identifying a series statements as true or false
Answering a series of questions
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
13-52
Data Warehousing Fundamentals
Practice 13-1
.....................................................................................................................................................
Practice 13-1
1 Identify whether the following statements are true or false.
Statement
True
False
The data refresh cycle is determined primarily by
information technology staff input.
The load window is the time that the IT group has dictated
the data warehouse is available to the users for access
Fact data frequently changes.
Dimension data infrequently changes.
2 Name four different techniques for capturing the changes to operational data that is
to be loaded into the warehouse.
_____________________
_____________________
_____________________
_____________________
3 Answer the following questions about updating dimension data.
a What method of updating dimension data would you employ if you wanted to
keep old and new records?
b What relationship would that map to in an entity relationship model?
4 What server technique can be used to prevent and allow access to data in the
warehouse after refresh?
.....................................................................................................................................................
Data Warehousing Fundamentals
13-53
Lesson 13: Transportation: Refreshing Warehouse Data
.....................................................................................................................................................
.....................................................................................................................................................
13-54
Data Warehousing Fundamentals
14
.................................
Leaving a Metadata Trail
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Meeting a
Business
Need
Choosing a
Computing
Architecture
Planning
Warehouse
Storage
Modeling
the Data
Warehouse
ETT
(Building
the
Warehouse)
Analyzing
User Query
Needs
Managing
the Data
Warehouse
Supporting
End User
Access
Project
Project Management
Management
(Methodology,
(Methodology, Maintaining
Maintaining Metadata)
Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
Define warehouse metadata, its types, and its
roles in a warehouse environment
•
•
Develop a metadata strategy
•
•
List tools for managing metadata
Describe in detail each type of warehouse
metadata
Describe the Oracle Common Warehouse
Metadata architecture (CWM)
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
Metadata has already been referenced a number of times in this course. It is critical to
every phase of warehouse design and development. This lesson examines the role of
warehouse metadata in greater detail.
Note that the “Project Management (Methodology, Maintaining Metadata)” block is
highlighted in the overview slide on the facing page.
Objectives
After completing this lesson, you should be able to do the following:
• Define metadata, its types, and the main roles of metadata in a warehouse
environment
• Describe the challenges of managing warehouse metadata
• List tools for managing metadata
• Describe the Oracle Common Warehouse Metadata architecture (CWM)
.....................................................................................................................................................
Data Warehousing Fundamentals
14-3
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Defining Warehouse Metadata
•
•
•
Data about warehouse data and processing
Vital to the warehouse
Used by everyone
Metadata
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-4
Data Warehousing Fundamentals
Defining Warehouse Metadata
.....................................................................................................................................................
Defining Warehouse Metadata
Data About Data
Metadata is “data about data.” Warehouse metadata is descriptive data about
warehouse data and the processes used in creating the warehouse.
Warehouse metadata contains detailed descriptions of the location, structure, and
meaning of data. It describes keys and indexes of the data. It contains mapping
information, and it documents the algorithms and business rules used to transform and
summarize data. Metadata is used throughout the warehouse, from the extraction stage
through the access stage.
Vital to the Warehouse
A warehouse with poor metadata is analogous to a filing cabinet filled with folders
stored in no particular order. It is very difficult to find your information in the cabinet.
Used by Everyone
Warehouse metadata is used directly or indirectly by everyone involved in creating,
maintaining, or using the warehouse: database administrators, analysts, designers, and
users. Warehouse metadata answers the following types of question:
• What information is available, by subject area, and when did we start collecting
that data?
• How was this summarization created?
• What queries are available to access the data?
• What business assumptions have been made?
• How do I find the data I need?
• How old is the data?
• What does that value mean?
.....................................................................................................................................................
Data Warehousing Fundamentals
14-5
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
The Key to Understanding Warehouse
Information
•
•
•
•
•
Specifies data location
•
•
Provides a record of changes
Manages data
Aids use of information
Describes the data
The Key to Understanding
Documents the
development process
Records enhancements over time
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-6
Data Warehousing Fundamentals
Defining Warehouse Metadata
.....................................................................................................................................................
Key to Understanding Warehouse Information
Metadata is the component that holds all the information about the data in the
warehouse, and presents it as information to the user.
Data becomes and provides information if, and only if, you:
• Have the data
• Know you have it
• Know where it is
• Can access the data
• Can trust the data
Metadata is the key to understanding the warehouse. Metadata helps you locate,
manage, and use warehouse information by:
• Specifying the location of data
• Managing data
• Aiding the use of information
• Describing the data
• Documenting the development process
• Providing a record of changes
• Recording enhancements over time
.....................................................................................................................................................
Data Warehousing Fundamentals
14-7
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Metadata Users
IT developers
Metadata
repository
ETT
End
user
Operational
Warehouse
End users
Copyright  Oracle Corporation, 1999. All rights reserved.
Types of Metadata
•
•
•
End user:
– Key to a good warehouse
– Navigation aid
– Information provider
ETT:
– Maps structure
– Source and target information
– Transformations
– Context
Operational:
– Load, management, scheduling processes
– Performance
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-8
Data Warehousing Fundamentals
Defining Warehouse Metadata
.....................................................................................................................................................
Metadata Users
In the warehouse, metadata is employed directly or indirectly by all warehouse users
for many different tasks.
End Users The decision support analyst (or user) uses metadata directly. The user
does not have the high degree of knowledge that the IT professional has, and metadata
is the map to the warehouse information. One measure of a successful warehouse is
the strength and ease of use of enduser metadata.
Developers For the developer, metadata contains information on the location,
structure, and meaning of data, information on mappings, and a guide to the
algorithms used for summarization between detail and summary data.
Types of Metadata
End User Metadata Enduser metadata describes the location and structure of data
for user access. It describes data volumes and algorithms. Essentially, this is the floor
plan that the knowledge worker uses to navigate through and around the data.
ETT Metadata Extraction, transformation, and transportation metadata (sometimes
called warehouse metadata or ETT metadata) maps the structure of source systems and
how the data is to be transformed into its new format for the warehouse. It contains all
the rules for extracting, scrubbing, summarizing, and transporting data. This is often
the most difficult metadata model to construct.
Operational Metadata Operational metadata is used by the load, management, and
access processes for scheduling data loads or enduser access. It contains information
about housekeeping activities, statistics of table usage, and information about every
aspect of performance.
Note: The Oracle Method has a specific process for metadata management. Enduser
metadata is referred to as business metadata.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-9
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Developing a Metadata Strategy
•
Define a strategy to ensure high-quality metadata
useful to users and developers.
•
Primary strategy considerations:
– Define goals and intended use.
– Identify target users.
– Choose tools and techniques.
– Choose the metadata location.
– Manage the metadata.
– Manage access to the metadata.
– Integrate metadata from multiple tools.
– Manage change.
Copyright  Oracle Corporation, 1999. All rights reserved.
Defining Metadata Goals and Intended
Usage
•
•
•
Define clear goals.
Identify requirements.
Identify intended usage.
Metadata
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-10
Data Warehousing Fundamentals
Developing a Metadata Strategy
.....................................................................................................................................................
Developing a Metadata Strategy
Like every other aspect of the data warehouse implementation, metadata should be the
subject of a well-considered, well-planned strategy. You must ensure that the metadata
is of a high quality, provides the right information to users and developers, and is able
to take into account the various tools that employ metalayers. Integrating these layers
is critical.
Primary Considerations
Among many other considerations, you need to resolve these key issues for the
strategy:
• Define the goals and intended use of the warehouse metadata.
• Identify the target users of warehouse metadata.
• Choose tools and techniques for creating and managing metadata.
• Choose the metadata location.
• Manage the metadata.
• Manage access to the metadata.
• Integrate multiple sets of metadata from different tools.
• Manage changes to metadata.
Defining Metadata Goals and Intended Usage
Identify the intention of the metadata you develop. Outline main requirements such as
maintaining history, context, and algorithms.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-11
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Identifying Target Metadata Users
•
Who are the metadata users?
– Developers
– End users
•
•
What information do they need?
How will they access the metadata?
Copyright  Oracle Corporation, 1999. All rights reserved.
Choosing Metadata Tools and
Techniques
•
Tools
– Data modeling
– ETT
– End-user query and analysis
•
•
•
Database schema definitions
COBOL copybooks
Middleware tools
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-12
Data Warehousing Fundamentals
Developing a Metadata Strategy
.....................................................................................................................................................
Identifying Target Metadata Users
Consider who, among both developers and end users, is to access metadata. What
information do they need? Determine how they will access the metadata.
Choosing Tools and Techniques
Data Modeling Tools These tools are also known as computer-aided software
engineering (CASE) tools. Oracle’s data modeling tool is Designer.
Some of these tools are better than others at physically modeling metadata. Consider
using a tool which either is specifically designed to model warehouse features or is
extensible. For example, can the tool model a star or a snowflake?
ETT Tools Tools for extracting, transforming, and transporting data into a
warehouse also generate metadata. These tools are expensive purchases, and may not
be employed for the first iteration during development. However, these tools have the
advantage of being able to create and maintain a metadata layer. The tool itself must
have all the information to take source data to the warehouse, so it is logical that the
tool itself contains this layer.
End User Tools Some tools for query and analysis allow the administrator to create
a metadata layer, which describes the structure and content of the data warehouse.
An administrator must consider a maintenance issue with tool metadata; for each
query tool you need to create a unique layer.
Database Schema Definitions Database schema definitions in a relational database
management system offer another potential source of metadata. In an Oracle
environment this is the Data Dictionary, which can be extended and enhanced.
Most dictionaries of database contents, including the Oracle Data Dictionary, are
limited in their immediate value as a metadata tool. Check the extending and
enhancing capabilities of these dictionaries.
Other Techniques Less-common sources of metadata include:
• File definitions stored in COBOL copybooks
• Middleware tools
.....................................................................................................................................................
Data Warehousing Fundamentals
14-13
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Choosing the Metadata Location
•
•
•
Usually the warehouse server
Possibly on operational platforms
Desktop tool with metalayer
External
sources
Operational
data
sources
Metadata
repository
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
Managing the Metadata
•
•
•
Managed by the metadata manager
Maintained by the metadata architect
Standards produced by the metadata architect
External
sources
Operational
data
sources
Metadata
repository
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-14
Data Warehousing Fundamentals
Developing a Metadata Strategy
.....................................................................................................................................................
Choosing the Metadata Location
For every process and product employed in the data warehouse environment, metadata
exists. Where it is stored is product-specific. The decision about where to place the
metadata is often determined by the tool you use to create it.
If you are using a relational database management system, then by default the
metadata resides in the database and usually on the warehouse server. This is the
preferred method. You may locate the metadata on a separate database on another
machine.
Some ETT and query tools have their own metalayer. Where this is the case you need
to ensure that each metalayer can communicate with the others.
Managing the Metadata
Management Given the critical importance of metadata within the warehouse
environment, it must be subject to strict control and management. Metadata is such a
vital component in your warehouse implementation that someone should be
responsible for managing and maintaining it.
It is also important to ensure that creation of or changes to metadata are agreed upon
with formal sign-off.
Maintenance A metadata architect is usually responsible for defining the strategy
and implementing metadata. This person is primarily responsible for ensuring that
metadata remains up-to-date and consistently reflects any changes within the business
infrastructure.
If there are different metalayers, the architect must control integration of the metadata
among products and tools.
Standards As with any development project, standards are critical. Determine
standards for every aspect of metadata from simple naming conventions, to versioning
requirements, to documenting complex algorithms.
Standards for metadata are emerging within the industry. It is worth monitoring the
changes that vendors are considering, as well as the collaborative exercises between
large computing companies who are looking to define standards.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-15
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Integrating Multiple Sets of Metadata
•
•
•
Multiple tools may generate their own metadata.
There are many metalayer integration issues.
Metadata exchangeability is desirable.
Copyright  Oracle Corporation, 1999. All rights reserved.
Managing Changes to Metadata
•
Different types of metadata have different rates of
change.
•
Consider metadata changes resulting from refresh
cycles.
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-16
Data Warehousing Fundamentals
Developing a Metadata Strategy
.....................................................................................................................................................
Integrating Multiple Sets of Metadata
Each of the tools you use in your warehouse environment might generate its own set of
metadata. One of the biggest problems with metadata is integrating all of the different
layers.
Some vendors provide tools that can exchange metadata. For example, you can take
metadata from Oracle Designer, populate it using Prism Directory Manager, and use it
directly in Oracle Discoverer.
Later in this lesson, we examine how Oracle Common Warehouse Metadata (CWM)
addresses the sharing of metadata among Oracle tools.
Managing Changes to Metadata
Metadata changes at different rates according to the type of data stored. For example,
models of operational and warehouse databases might remain static for a substantial
period of time; however, metadata that maintains information about the warehouse
data changes frequently.
Each data refresh brings in more data each cycle. With it, summaries may change,
dimensions may change, and more.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-17
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Examining Types of Metadata
•
•
ETT metadata
End user metadata
Metadata
repository
End
user
ETT
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
ETT Metadata
•
•
•
•
•
•
•
•
•
•
•
Business rules
Source tables, fields, and key values
External
Ownership
sources
Field conversions
Extraction
Encoding and reference table
Operational
Name changes
data
Key value changes
sources
Default values
Logic to handle multiple sources
Algorithms
Time stamp
Staging
file
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-18
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
Examining Types of Metadata
Now we will examine more closely the different types of warehouse metadata. This
includes ETT metadata generated during warehouse development, as well as end-user
metadata.
ETT Metadata
ETT metadata defines how data from the physical level in the source system maps to
the physical level in the data warehouse. ETT metadata also holds:
• The business rules that are applied to the warehouse data
• Names of the source tables, source fields, and source key values
• Information about the owner of the source data
• The rules that are applied to field conversions on a field-by-field basis
• Encoding and reference table conversions
• Field name and key value changes
• Default values assigned to NULL fields
• Logic to extract records from multiple source systems and create records (or a
single record) for the load process
• Algorithms that create derived data:
Unit_Sold / Total_Sales = Selling_Price
•
Time-stamp details
You have seen how complex the ETT process is, and you can now appreciate the
importance of keeping a record of exactly what is happening, to which data and when,
what the grain is, what is derived, how data is summarized, where it is sourced, and
what its target is.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-19
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Extraction Metadata
•
•
•
•
•
•
•
•
•
•
Space and storage requirements
Source location information
Diverse source data
External
sources
Access information
Extraction
Security
Contacts
Program names
Operational
data
sources
Frequency details
Failure procedures
Validity checking information
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-20
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
Extraction Metadata
Extraction metadata contains:
• Space requirement information
• Storage frequency and duration details
• Source location information such as hardware platform information, gateway
information, operating system, file system, database, origin and destination
information, and loading rules
• Diverse system information with details of the source type such as whether the
data is production, internal, external, or archive; structure information such as file
type, name, field type, and data granularity
• Access information such as alias names, versions, relationships, data volatility
• Security information, table owners, data owners, authorization levels, audit trail
information
• Source data contact and owner details; for example, their names, telephone
numbers, e-mail identifiers
• Extraction program names
• Temporary storage details, name of storage file, procedure for removing storage
files
• Extraction frequency details
• Extraction failure procedures, with contingency plans and mechanism for handling
failed extract
• Extraction validity check information including the procedures to implement,
expected results, procedures to follow should the validity check fail, names of the
people to contact if the check fails
.....................................................................................................................................................
Data Warehousing Fundamentals
14-21
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Transformation Metadata
•
•
•
•
•
•
•
Duplication routines
Exception handling
Key restructuring
External
sources
Transform
Operational
data
sources
Staging
file
Grain conversions
Program names
Frequency
Summarization
Copyright  Oracle Corporation, 1999. All rights reserved.
Transportation Metadata
•
•
•
•
•
•
Method of transfer
Frequency
Validation procedures
Failure procedures
Deployment rules
Contact information
Transform
External
sources
Metadata
repository
ETT
Transport
Transport
Operational
data sources
Staging
file
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-22
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
Transformation Metadata
Transformation metadata contains:
• Duplication routines for elimination, consolidation, ordering, and summarization
of data
• Exception handling and validation procedures
• Key restructuring rules
• Granularity conversions
• Transformation program names and locations
• Frequency of the transformation
• Summarization procedures
Transportation Metadata
Transportation metadata contains:
• Data-transfer methods
• Frequency of transportation
• Validation procedures
• Failure procedures
• Rules for deployment
• Contact information, in case of any issue with the data or the movement of data
.....................................................................................................................................................
Data Warehousing Fundamentals
14-23
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
End-User Metadata
739516 1816 666 15 17.62
•
•
•
Need to know the context of the table queried
Associate the metadata description
Analogous to Oracle Data Dictionary views
Metadata
repository
End
user
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
Example of End User Metadata
Table
Name
Column
Name
Data
Meaning
Product Prodid
739516 Unique identifier for the product
Product Valid_date
01/97
Last refresh date
Product Ware_loc
1816
Warehouse location number
Product Ware_bin
Product Code
666
15
Product Weight
17.62
Warehouse bin number
The color of the product; please
refer to table COL_REF for details
Packed shipping weight in
kilograms
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-24
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
End-User Metadata
If the following data is warehouse data, how much can you deduce?
739516 0197 1816 666 15 17.62
You can deduce nothing tangible in this data other than a series of numbers. It could
represent product codes, map coordinates, or employee salaries. The only way to
deduce information from this data is to know the context of the table you are querying.
For example, if you are querying the PRODUCT table and the PRODUCT CODE
column, metadata may show the information as follows:
Table Name
Product
Product
Product
Product
Product
Column Name
Prodid
Valid_date
Ware_loc
Ware_bin
Color_code
Data
739516
01/97
1816
666
15
Product
Weight
17.62
Meaning
Unique identifier for the product
Last refresh date
Warehouse location number
Warehouse bin number
The color of the product; please refer to
table COL_REF for details
Packed shipping weight in kilograms
When you associate the data with its metadata description, the data becomes
information.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-25
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
More End-User Metadata Information
•
•
•
•
•
Location of fact and dimensions
Availability
Description of contents
Algorithms for derived and summary data
Owners of data and telephone number
Metadata
Repository
End
user
Warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-26
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
More End-User Metadata Information
The user never accesses end-user metadata directly. This metadata is viewed from the
end user’s tool and is used to navigate around the data.
Using this metadata, users can see the data available in the warehouse environment
and establish the meaning of elements within the warehouse.
User metadata describes:
• The physical location of fact and dimension data.
• The availability of the data. Not all data components of the warehouse are
available to every user. Some facts may be sensitive to specific user groups.
• The exact description of the contents and business algorithms used to create
summary data. Users should never be in a position where they are guessing how a
summary has been calculated.
• How derived data has been created, the source data, and any algorithms used.
• Data ownership details, so that if there are any problems with the data content, the
user can ask the appropriate person questions about the data and identify or rectify
the problems found. This information must supply telephone number, fax number,
or e-mail address.
Data ownership details are possibly the most important aspect of end-user metadata. If
there is an issue with the data, it must be resolved quickly and appropriately.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-27
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Historic Context of Data
•
•
Supports change history
Maintains the context of information
Operational
Warehouse
Metadata
repository
Structure
Content
94
95
96
97
98
97
98
Copyright  Oracle Corporation, 1999. All rights reserved.
Types of Context
•
•
•
Simple:
– Data structures
– Naming conventions
– Metrics
Complex:
– Product definitions
– Markets
– Pricing
External:
– Economic
– Political
Warehouse
94
95
96
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-28
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
Historic Context of Data
Historic data often has business rules and algorithms applied that are different from
those applied to current data.
In the operational environment, there is only one definition of the database structure at
any time. In the warehouse environment, data definitions change over a period of time.
It is important to record the date when data changes, names, key values, default values,
and algorithms to allow knowledge workers to analyze the data in the correct context.
This ensures you can understand and identify the differences in the context of the data
in historical files.
For example, you may store data for 1994–1996 offline. Suppose you want to store
1997 data online. The default value for an amount field changed from a series of 9s to
0s in 1995. You can run a query to identify amounts between 1994 and 1997, but if you
do not understand when and how default amounts were recorded, you may not be able
to explain or understand why both 9s and 0s are stored, or realize the impact that the
change has on calculations or reports.
Another example arises with products such as personal computers that had very few
components when they were first available. Consider the changes they have gone
through and the many components they contain today. There is a rapid and voluminous
history of change.
Types of Context
The context of data in the warehouse may be:
• Simple contextual information such as data structures, data coding, naming
conventions, and data metrics
• Complex contextual information such as product definitions, market territories,
pricing, packaging, and rule changes
• External contextual information such as economic forecasts, political information,
and competitive information
.....................................................................................................................................................
Data Warehousing Fundamentals
14-29
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Additional Metadata Content and
Considerations
•
•
•
•
•
•
Summarization algorithms
Relationships
Stewardship
Permissions
Pattern analysis
Reference tables
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-30
Data Warehousing Fundamentals
Examining Types of Metadata
.....................................................................................................................................................
Additional Metadata Content and Considerations
Some of these points may have already been mentioned.
Summarization Algorithms You have seen that the warehouse contains fully
detailed fact records and summary records that are created according to predefined
algorithms. The meaning of the summaries is maintained in the metadata.
Relationships Relationships show how tables are related, their constraints and rules,
and the cardinality of data. This relationship information is maintained in the
metadata. This information is documented along with ownership information and text
descriptions of tables and keys.
Stewardship Metadata must identify the originator of data. Bear in mind that the
data in the warehouse has come from many different source systems, with different
suppliers, different owners, and different transformation issues.
Permissions Metadata should maintain, for each record, information about who can
access the records and who is authorized to grant permissions on it.
Access Pattern Analysis Metadata should be able to record frequently accessed
data, in order to tune and optimize performance accordingly. In turn, this may identify
data accessed infrequently or not at all. You should remove data and summaries that
are not accessed.
Reference Tables The contents of these tables must be monitored and maintained
with information that relates to their effective date.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-31
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Metadata Management Tools
•
•
•
•
•
•
•
•
•
Carleton
Evolutionary Technologies
Hewlett Packard
Informatica
Information Advantage
Oracle Designer
Platinum Technology
Prism Solutions
Sagent
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-32
Data Warehousing Fundamentals
Metadata Management Tools
.....................................................................................................................................................
Metadata Management Tools
WTI Partner
Carleton Corp.
Evolutionary Technologies
Hewlett Packard
Informatica
Information Advantage
Oracle
Platinum Technology, Inc.
Prism Solutions
Sagent
Product
Carleton Passport-Metadata
ETI Repository (ObjectStore)
IW Guide
Informatica Repository
Meta Agent
Designer
Data Mart Suite
OADW/Warehouse Builder
Data Dictionary/Solution Repository (DD/
S), Data Shopper, DB Excel, Platinum
Repository
Prism Directory Manager
There are two categories of metadata management tools:
• Generic repository tools, for managing enterprisewide metadata, such as:
– Data Shopper from Platinum Technology
– Data Dictionary from Brownstone/Platinum
– Manager Link from Manager Software Products
• Tools specifically for data warehouses and data marts, such as:
– Prism Directory Manager from Prism Solutions
– Meta Agent from Information Advantage
– Passport from Carleton Corporation
– SmartData Warehouse from Intersolv
.....................................................................................................................................................
Data Warehousing Fundamentals
14-33
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Common Warehouse Metadata
Analytic applications
Report
Operational
data
Query
Warehouse
ERP
data
Data
integration
Information
delivery
Analyze
Marts
External
data
Mine
Metadata
Design and Administration
Copyright  Oracle Corporation, 1999. All rights reserved.
Common Warehouse Metadata Future
Warehouse
Builder
Discoverer
Common
metadata
Oracle8i
Server
Express
Server
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-34
Data Warehousing Fundamentals
Common Warehouse Metadata
.....................................................................................................................................................
Common Warehouse Metadata
Common Warehouse Metadata (CWM) is Oracle’s open standard for data warehousing
metadata. CWM incorporates both technical and business meta data and covers all
aspects of warehousing. CWM will enable tighter integration of metadata among
Oracle’s products as well as across industry-leading tools from Oracle partners,
resulting in reduced implementation complexity and greater user productivity.
To enable truly open data warehouse functionality, Oracle submitted a Request for
Proposal for a Common Warehouse Metadata Interchange standard to the Object
Management Group (OMG). The Common Warehouse Metadata Interchange (CWMI)
standard will enable the interchange of warehouse metadata among data management
and analysis tools, and among warehouse metadata repositories.
One Meaning
Oracle acquired One Meaning, a company specializing in metadata. One Meaning’s
metadata technology provides the means for metadata interoperability and transfer,
reduces the cost of managing information resources, and enhances the value of stored
proprietary information. Oracle’s metadata strategy will provide essential integration
and continuity, and add ongoing value to data warehousing implementations.
.....................................................................................................................................................
Data Warehousing Fundamentals
14-35
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
•
•
•
•
•
•
•
Definitions
Integration
Contents
Storage
Creation
Selection
Tools
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-36
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson discussed the following topics:
• The definitions of the two main types of metadata
• The problems associated with metadata in the warehouse
• Metadata contents
• How metadata might be created
• Where metadata may be stored in a warehouse environment
• Selection criteria for metadata management tools
• A list of metadata management tools available from WTI partners and Oracle
.....................................................................................................................................................
Data Warehousing Fundamentals
14-37
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
Practice 14-1 Overview
This practice covers the following topic:
Answering a series of short questions
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
14-38
Data Warehousing Fundamentals
Practice 14-1
.....................................................................................................................................................
Practice 14-1
1 Why is metadata important to the following people?
Users who are accessing the data warehouse
________________________________________________________
________________________________________________________
b IT staff developing ETT routines
________________________________________________________
________________________________________________________
Name two techniques you might employ to create metadata.
________________________________________________________
________________________________________________________
Name two roles within the data warehouse development team who have
responsibility for metadata.
________________________________________________________
________________________________________________________
What is the issue with integration and metadata?
________________________________________________________
________________________________________________________
________________________________________________________
What is important about the context of data?
________________________________________________________
________________________________________________________
Name the Oracle tools you may use to develop metadata.
________________________________________________________
a
2
3
4
5
6
.....................................................................................................................................................
Data Warehousing Fundamentals
14-39
Lesson 14: Leaving a Metadata Trail
.....................................................................................................................................................
.....................................................................................................................................................
14-40
Data Warehousing Fundamentals
15
.................................
Supporting End-User
Access
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Meeting a
Business
Need
Choosing a
Computing
Architecture
Planning
Warehouse
Storage
Modeling
the Data
Warehouse
ETT
(Building the
Warehouse)
Analyzing
User Query
Needs
Supporting
Supporting
End
End User
User
Access
Access
Managing
the Data
Warehouse
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Objectives
After completing this lesson, you should be able to
do the following:
•
•
•
Describe the importance of business intelligence
•
Identify data mining tools
Identify multidimensional query techniques
Identify where data mining might be employed in a
warehouse environment
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
The previous lesson covered leaving a metadata trail. This lesson discusses supporting
end-user access. Note that the “Supporting End User Access” block is highlighted in
the course road map on the facing page.
Specifically, this lesson introduces the concept of business intelligence. The lesson
discusses the discovery model used by mining tools, and the reasons enterprises are
looking at data mining solutions for discovery of information.
Objectives
After completing this lesson, you should be able to do the following:
• Describe the importance of business intelligence
• Identify multidimensional query techniques
• Identify where data mining might be employed in a warehouse environment
• Identify data mining tools
.....................................................................................................................................................
Data Warehousing Fundamentals
15-3
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
What Is Business Intelligence?
“Business Intelligence is the process of transforming
data into information and through discovery
transforming that information into knowledge.”
Gartner Group
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Business Intelligence
The purpose of business intelligence is to convert the
volume of data into value for the end users.
Decision
Stages (4)
Value
Knowledge
Information
Data
Copyright  Oracle Corporation, 1999. All rights reserved.
Volume
®
.....................................................................................................................................................
15-4
Data Warehousing Fundamentals
Business Intelligence
.....................................................................................................................................................
Business Intelligence
Companies require business intelligence to direct business process improvement and
monitor time, cost, quality, and control.
Definition
Howard Dressner, analyst with the Gartner Group, defines business intelligence as a
process of turning data into information and through iterative discoveries turning that
information into business intelligence. The key is that business intelligence is a
process—cross functional, in line with current management thinking, and not
presented in IT terms.
Purpose of Business Intelligence
The purpose of business intelligence is to the large volumes of data into information,
linking bits of information together within a decision context that turns it into
knowledge that can be used to aid decision making.
This can be accomplished through the use of data access tools and techniques that use
organized collections of data, systems, and applications by which organizations gather
and interpret relevant information about the business and turn it into highly
quantifiable plans, policies, procedures, and metrics.
The value chain begins with data resource. Data is defined as facts and figures.
Information is data processed and interpreted into a meaningful framework. It is a set
of data in context that is relevant to one or more people at a point in time or for a
period of time.
Knowledge refers to meaning and understanding that results from processing
information by users. In order for knowledge to be useful in the decision making
process, there must be a high-quality integrated resource, high-quality information
preparation and sharing, and a high-quality human resource to discover and
accumulate knowledge to achieve successful business intelligence.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-5
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Multidimensional Query Techniques
Product
Slicing
Time
Why?
What?
Why?
Dicing
Geography
Why?
Drilling
down
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Multidimensional Query Techniques
Why?
What?
Why?
Drilling
up
Drilling
across
Why?
Pivoting
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-6
Data Warehousing Fundamentals
Multidimensional Query Techniques
.....................................................................................................................................................
Multidimensional Query Techniques
These techniques are standard in modern query tools that present data in a
multidimensional manner. The following defines some of the common
multidimensional query techniques.
Slicing
Slicing means limiting the view of data to a selection of the data to a selection of
consultant, region, or cost center. An example of a slice of data can be a view of the
data for a regional manager across all products and time periods
Dicing
Dicing is slicing in multiple directions. You are making the selection along more than
one dimension. In dicing, you can refine the selection by adding or removing data
more of the data cube.
Drilling
Drilling is being able to open up a subset of data that corresponds with a particular
value of a dimension. It is a term used to describe the action of moving down to further
levels of data detail or up to higher levels of summary data.
Drilling Down Is a mechanism that enables the user to examine the detail for a
summary value. The user may examine where rackets were sold, to what companies,
and how many items any individual purchased.
Drilling Up Is the ability to query detail records and navigate up to higher level
summary records.
Drilling Across
report.
Is the ability to query from one fact table to another in a single
Pivoting
Pivoting data is changing the axes along which you orient your data. It also refers to
the ability to change the organization of rows and columns in a tabular report. This
enables the user to view the data along different dimensions without requerying the
database itself.
OLAP has other associated query techniques, some of which are vendor dependent.
For example top/bottom analysis selects the top or bottom ranges of data based on
criteria to perform exception reporting.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-7
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Categories of Business Intelligence Tools
•
•
•
•
•
•
Reporting tools
Query tools (data access)
On-line analytical reporting (OLAP) tools
Analytical suites
Data mining tools
Analytical applications
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Evolution of Reporting
ClientServer
Mainframe
Multitier
enterprise
reporting
• Batch oriented
• End user empowered
• Easy to use
• IS controlled
• Reduced IS manageability
• Manageable
• 3GL-based
• Expensive
• Scalable
• Not user-specific
• Localized
• Accessible
• Inflexible
• IS intensive
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-8
Data Warehousing Fundamentals
Categories of Business Intelligence Tools
.....................................................................................................................................................
Categories of Business Intelligence Tools
According to Wayne Eckerson from the Data Warehousing Institute (Criteria for
Evaluating Business Intelligence Tools, Journal of Data Warehousing, Volume 4,
Number 1, Spring 1999), the categories of business intelligence tools are:
• Reporting tools
• Query tools (data access)
• Online analytical reporting (OLAP) tools
• Analytical suites
• Data mining tools
• Analytical applications
Reporting Tools
The tools allow users to produce canned, graphic-intensive, sophisticated reports
based on the warehouse data. The evolution of reporting is shown below.
Mainframe In the mainframe era, batch reporting generated large, cumbersome
reports. These reports were constructed from time consuming, difficult to use 3GL
programming environments.
Client/Server The advent of the PC brought rich graphical user interfaces, leading
to the introduction of much more productive 4GL reporting tools. This, combined with
the advent of client-server computing, began to deliver much more user-friendly and
flexible reports.
Enterprise Reporting We are now in the enterprise reporting era. This new
reporting architecture delivers the combined benefits of mainframe and client-server
environments.
Oracle Reports is an enterprise reporting tool for developers to build and disseminate
sophisticated, high-quality reports. Users view reports dynamically generated by the
application-server-reporting engine. Users can access reports from anywhere in the
enterprise using a web browser.
Oracle Reports takes advantage of the scalability of the internet computing model. The
powerful reports server helps you to easily deploy your applications in a multi-tier
environment that uses an advanced caching technology to provide dynamic load
balancing.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-9
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Oracle Discoverer 3.1
User
Viewer
Edition
Edition
End User Layer
Transaction Database or Data Warehouse/Mart
Administration
Edition
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Discoverer for the Web
•
•
View workbooks using a Web browser
•
Cost-effective
Business intelligence tool that provides
information anywhere and at any time
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-10
Data Warehousing Fundamentals
Categories of Business Intelligence Tools
.....................................................................................................................................................
Query Tools
These tools enable users to explore a data source using intuitive ad hoc queries. The
tools provide the means for pulling the desired information from a database. They are
typically SQL-based tools and allow a user to define data in end-user language.
Oracle Discoverer is Oracle’s award-winning ad hoc query, reporting and analysis tool
designed by end users for the end users. Oracle Discoverer for the Web makes it easy
for any user to leverage information in data warehouses, data marts, and relational
databases using a web browser. It features industry-leading ease of use and
performance features such as query prediction and automatic summary management
which provide time and cost savings for the enterprise. The components of Oracle
Discoverer 3.1 are shown below.
Discoverer User Edition As an end user, you use this component to perform ad hoc
queries, generate reports, and publish information stored in the online dictionary.
Discoverer Administration Edition Business and information technology (IT) data
administrators use this component to create, maintain, and administer data and the
users’ interaction with that data.
End-User Layer This component, a server-based meta layer, hides the complexity
of the underlying relational database so that you can interact with the online dictionary
using ordinary business terms.
Discoverer Viewer Edition As an end user, you use this component to view your
data using a Web browser. Using the Discoverer Viewer, you can view the workbooks
that you have created in the User Edition, through the Internet. You can use Internet
Explorer 4.0 or Netscape 4.05 or higher browsers to access Discoverer Viewer, and it
takes advantage of the existing Discoverer installations, thus providing easy access at
any time to the workbooks stored in the database. Because of the consistent user
interface between the User Edition and the Viewer Edition, users can easily work with
their stored workbooks in Discoverer Viewer without any additional training.
The following features are available in Discoverer Viewer:
• View workbooks stored in the database
• Use drilling
• Refresh data
• Print reports
• Provide parameters to view specific data
• Customize the execution of queries
.....................................................................................................................................................
Data Warehousing Fundamentals
15-11
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Online Analytical Processing (OLAP)
Product mgr. view
Regional mgr. view
Prod
Market
Sales
Time
Financial mgr. view
Ad hoc view
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Advanced Analytical Tasks
•
•
•
•
•
•
•
Comparative and relative analysis
Exception and trend analysis
Time series analysis
Forecasting
What-if analysis
Modeling
Simultaneous equations
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-12
Data Warehousing Fundamentals
Categories of Business Intelligence Tools
.....................................................................................................................................................
Online Analytical Reporting (OLAP) Tools
OLAP tools provide a multidimensional view of data, allowing users to easily navigate
through multiple dimensions (such as customer, organization and time) and hierarchies
within dimensions (such as year, quarter, and month).
The different types of tools in this category are multidimensional OLAP (MOLAP),
relational OLAP (ROLAP), and hybrid OLAP (HOLAP). They have been discussed in
Lesson 6.
Oracle Express Oracle Express provides sophisticated online analytical processing
(OLAP) analysis through its advanced calculation engine and multidimensional data
cache. The Express multidimensional data model is optimized for the query and
analysis of corporate data, such as sales, marketing, financial, manufacturing, or
human resource data.
Oracle Express provides a native multidimensional data model for optimal OLAP
power and performance. The multidimensional model:
• Is specifically designed for analysis
• Inherently reflects the way users think about their businesses
• Ensures that end users can efficiently analyze data in a structured or ad hoc
fashion, without requesting special programs from IS personnel
Through its built-in analytic functions, Oracle Express provides the answers to a range
of complex analytic questions.
Oracle Express enables users to perform advanced analytical tasks, such as:
• Comparative and relative analysis
• Exception and trend analysis
• Modeling
• Forecasting
• Time-series analysis
• What-if analysis
It delivers powerful analytical capabilities to any Web browser, enabling sophisticated
analysis over corporate intranets and the Internet.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-13
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Analytical Suites
•
Enterprise business intelligence (EBI) toolsets:
– Web-enabled query, reporting, and analysis
tool that runs on a robust application server
– EBI toolset tightly integrates query, reporting,
and analysis capabilities within a single tool
– Shares a common look and feel
•
Business portals:
– EBI toolset with a Yahoo!-like user interface
– Flexible repository handles structured and
unstructured data objects
Data Warehousing Institute
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Data Mining Tools
•
Identify patterns and relationships in data that are
often useful for building models that aid decision
making or predict behavior
•
Data mining uses technologies such as neural
networks, rule induction, and clustering to
discover relationships in data and make
predictions that are hidden, not apparent, or to
complex to be extracted using statistical
techniques.
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-14
Data Warehousing Fundamentals
Categories of Business Intelligence Tools
.....................................................................................................................................................
Analytical Suites
According to Wayne Eckerson from the Data Warehousing Institute, the tools in the
analytical suites are as follows.
Enterprise Business Intelligence (EBI) Toolsets An EBI toolset is a Web-enabled
query, reporting, and analysis tool that runs on a robust application server instead of a
desktop machine. An EBI toolset tightly integrates query, reporting, and analysis
capabilities within the context of a single tool as opposed to a suite of tools. Each
analytical “modality” shares a common look and feel and passes data seamlessly to
each of the other modalities, as required. Web and client-server versions offer
equivalent functionality.
Business Portals A Business Portal is an EBI toolset with a Yahoo!-like user
interface. This tool has a flexible repository that handles structured and unstructured
data objects, and a publish or subscribe engine that delivers reports to users on a
customizable basis.
(Criteria for Evaluating Business Intelligence Tools, Journal of Data
Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)
Data Mining Tools
Data mining tools identify patterns and relationships in the data that are often useful
for building models that aid decision making or predict behavior. Data mining uses
technologies such as neural networks, rule induction, and clustering to discover
relationships in data and make predictions that are hidden, not apparent, or to complex
to be extracted using statistical techniques.
Note: Data mining will be covered in the next section.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-15
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Analytical Applications
•
Packaged analytical application has a predefined:
– Extraction feeds and transformation routines
for a specific data source
– Data model, application-specific report
templates, and a custom end-user interface.
•
Custom analytic applications are workbenches
that enable developers to quickly create analytic
applications from coarse-grained components,
including user interface widgets, data access and
analysis components, and report layouts.
Data Warehousing Institute
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-16
Data Warehousing Fundamentals
Categories of Business Intelligence Tools
.....................................................................................................................................................
Analytical Applications
According to Wayne Eckerson from the Data Warehousing Institute,
“Analytical applications incorporate business intelligence tools and a data
warehouse or data mart to deliver analytical capabilities within a well-defined
business process. An analytical application uses a custom interface to step users
through a set of data collection and analysis tasks that lead up to a decision. The
analytical application also provides the context for users to act on their business
decisions, whether it involves emailing a document, updating a database, or
initiating a workflow.”
(Criteria for Evaluating Business Intelligence Tools, Journal of Data
Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)
The tools in the analytical applications are described below.
Packaged Analytic Application Packaged analytic applications come with a
predefined extraction feeds and transformation routines for a specific data source, a
predefined data model, application-specific report templates, and a custom end-user
interface.
Custom Analytic Application The custom analytic applications are workbenches
that enable developers to quickly create analytic applications from coarse-grained
components, including user interface widgets, data access and analysis components,
and report layouts.
(Wayne Eckerson, Criteria for Evaluating Business Intelligence Tools, Journal
of Data Warehousing, pg. 29, Volume 4, Number 1, Spring 1999)
.....................................................................................................................................................
Data Warehousing Fundamentals
15-17
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Definition of Data Mining
“Data mining is the exploration and analysis of large
quantities of data in order to discover meaningful
patterns, trends, relationships, and rules.”
Data mining is also known as:
•
•
•
Knowledge discovery
Data surfing
Data harvesting
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Uses of Data Mining
•
•
•
•
•
Customer profiling
Market segmentation
1000 2000 2000 3456 6577
Buying pattern affinities
2000 56600 78797 990
Database marketing
90091 87885 4565 12854
Credit scoring and risk analysis
12090 123599 279878 999
109988 1987363 10928783
33345 67398 320793 39384
320983 57583 398 209
8378373 10076 354802
2973673 3939399 306145
01910 46458 817262
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-18
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Data Mining in a Warehouse Environment
Definition of Data Mining
Data mining is the exploration and analysis of large quantities of data in order to
discover meaningful patterns, trends, relationships, and rules. The purpose of data
mining is to enable proactive business decisions. Data mining tools empower the user
to search for patterns of information in data. Data mining is far less user-directed and
relies upon specialized algorithms, such as fuzzy logic, neural networks, genetic
algorithms, and induction, that correlate information from the data warehouse and
assist in trend analysis. Data mining also refers to a process rather than a technology,
the goal of that process being to explore large amount of data to discover new trends,
relationships, and categories in that data. Data minng is also referred to as knowledge
discovery, data surfing, or data harvesting.
Uses of Data Mining
Data mining has many applications:
• Store owners can use it to determine and market products according to user
classification.
– Affinities
– Purchasing patterns
– Goods purchased (basket analysis)
• Business analysts can use it to determine patterns of product purchases.
– Fraud detection
– Profile buying patterns
– Determining high-and-low risk customers
• Credit card suppliers can use it to target an audience for a new card service. Credit
scoring and risk analysis in financial institutions.
Data mining techniques can be used by anyone who needs to:
• Develop strategies for marketing
• Target mail lists
• Adjust inventory levels
• Minimize operational and financial risks to the business
• Keep costs to a minimum
• Find out something new and never before considered
.....................................................................................................................................................
Data Warehousing Fundamentals
15-19
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Functions of Data Mining
•
•
•
•
•
•
•
Discovers facts and data relationships
Finds patterns
Determines rules
Retains and reuses rules
Presents information to users
May take many hours
Requires knowledgeable people to
analyze the results
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-20
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Functions of Data Mining
Discovery Data mining queries discover facts and data relationships using
techniques such as association, frequency of occurrence, and sequential patterns.
Rule Retention Data mining techniques learn patterns, and create rules to describe
the patterns; the rules are retained for reuse against larger data sets of data for further
analysis.
Self-Motivating Some data mining queries require little human intervention, but do
need guidance. Certain data mining models, such as cluster analysis, do not require
any guidance at all. On the whole, data mining tasks are a guided discovery of data,
that is, you have a notion of what it is you are trying to find out—information about
debtors or selling patterns, for example.
Expert Analysis The results of a query, once presented, need knowledgeable people
to analyze and use them correctly.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-21
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Comparing DSS and Data Mining Queries
•
DSS queries:
– Based on prior knowledge and assumptions
– User-driven
•
Data mining queries:
– Require domain-specific knowledge
to interpret data
– User-guided
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-22
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Comparing DSS and Data Mining Queries
Decision support queries are driven by a user who knows how to pose a question in
order to achieve specific results. The user knows what the question is and requires the
DSS application only to supply the answer. Therefore, the user applies known
parameters to the query prior to execution, in order to achieve a result based on those
known parameters.
Data mining queries differ in that the user provides some initial guidance. It requires
users to have the domain-specific knowledge to interpret the data. Data mining can
find answers to problems and information you have not considered before.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-23
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Artificial Neural Networks
•
•
•
Predictive model that learns
Developed from understanding of the human brain
Multiple regression and other statistical
techniques
1
5
2
6
3
8
7
4
Inputs
Hidden layer
Outputs
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Decision Trees
•
•
•
Represent decisions
Annual salary
100,000
Generate rules
Classify
Annual
outgoing
<10,000
Good
Copyright  Oracle Corporation, 1999. All rights reserved.
Annual
credit
> 50,000
Bad
®
.....................................................................................................................................................
15-24
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Data Mining Techniques
Artificial Neural Networks Neural networks are nonlinear predictive models that
learn through training. They look like biological neural networks in structure.
A neural network is a network of processors, each of which contains an amount of
local memory. The units are connected by communication channels carrying numeric
data, encoded by various means. The processors operate only on their local data and
on the inputs they receive through the communication channels. The field of neural
networks arose from the development of artificial intelligence systems (among other
technologies) capable of sophisticated computations similar to those performed by the
human brain.
Much of the improvements in neural network technology have been applied since
there has been much improved understanding of how the human brain functions.
Most neural networks have a training rule whereby the weights of communications are
adjusted based on the data; that is, they learn from examples. Neural networks are
employed by statisticians, engineers, scientists, and neurophysiologists to explore
brain function.
Neural networks can be used for classification, clustering, modeling, determining
sequences, and multiple regression and other statistical techniques.
Decision Trees These are tree-shaped structures that show a route taken by a certain
decision, or a series of decisions. Each decision generates a rule to classify the data
that it returns. A bank may use a decision tree to determine the worthiness of a
customer requesting a loan (is the customer a good or a bad risk?). This is
classification.
Some tools that support decision tree technology (rather than data mining technology)
can display decision tree results graphically.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-25
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Other Techniques
•
•
•
•
•
Genetic algorithms based on evolution theory
Statistics such as averages and totals
Nearest neighbor to find associations
Rule induction applying IF-THEN logic
Experiment with different techniques
K
K
K
K
K
K
K
K
K
K
K
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-26
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Data Mining Techniques (continued)
Genetic Algorithms These are essentially optimization techniques using processes
such as natural selection and genetic combination. The design is based on the concepts
of evolution (Darwin’s theory of the survival of the fittest) and mutation theories.
Statistics and Quantitative Analysis Data mining uses statistics based on linear
models that may be quite complex, such as averages, distributions, ranking,
regression, clustering, and other statistical techniques. There is an overlap between the
fields of neural networks and statistics.
Nearest Neighbor This technique is used for finding associated or clusters of
records. It classifies each record in a select set of data, based on a combination of the
classes of the K records most similar to it, where K is greater than or equal to one.
Rule Induction Data mining can extract useful IF-THEN rules based on the
statistical significance of the data. Rule induction allows you to find data associations
and sequences, and employs decision tree techniques for prediction and analysis.
No single mining technique can be recommended in isolation. The data to be analyzed
varies between businesses; the hypotheses tested are diverse. You should consider
employing as many techniques as the tool allows; you must experiment.
Note: There are many other techniques used in data mining. This is just a sample
selection.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-27
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Associations
Which items are purchased in a retail store at the
same time?
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Sequential Patterns
What is the likelihood that a customer will buy a
product next month, if he buys a related item today?
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-28
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Typical Data Mining Results
Associations Data mining can discover associations between items, that is, how
items relate to each other. It answers questions such as, “Which items are purchased in
a retail store at the same time?” For example, shirts and ties, eyeliner and mascara, or
cameras and televisions. However, this result does not determine the rationale behind
the association.
Sequential Patterns Data mining can describe associations over some period of
time. It can answer questions such as, “What is the likelihood that a customer will buy
a product in the future, if he buys a related item today?” For example, personal
computer today, printer next month; or a set of tools today and the toolbox to put them
in tomorrow.
Patterns involving time emerge. For example, if a customer buys a set of tools today,
there may be a pattern that shows the percentage likelihood of the toolbox being
purchased tomorrow, within one week, or within two weeks. This is a good way for a
retail store to determine a marketing campaign. Classification results enable the store
to target the correct customer at the same time.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-29
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Classifications
Determine customers’ buying patterns, and then find
other customers with similar attributes that may be
targeted for a marketing campaign.
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Modeling
Use factors, such as location, number of bedrooms,
and square footage, to determine the market value of
a property
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-30
Data Warehousing Fundamentals
Data Mining in a Warehouse Environment
.....................................................................................................................................................
Typical Data Mining Results (continued)
Classification Data mining can divide items into groups.
Determine customers’ buying patterns, and then find out other customers with similar
attributes that may be targeted for a marketing campaign: credit card users with
balances within 10% of their maximum credit limit; people employed in the
construction industry.
Modeling Data mining can map a set of input values to a single output value.
For example, you may use factors such as location, number of bedrooms, and square
footage to determine the market value of a property.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-31
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Oracle Data Mining Partners
•
•
•
•
•
•
•
Angoss International, Ltd.
DataMind Corp.
Datasage, Inc.
Information Discovery, Inc.
SPSS Inc.
SRA International, Inc.
Thinking Machines Corp.
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-32
Data Warehousing Fundamentals
Oracle Data Mining Partners
.....................................................................................................................................................
Oracle Data Mining Partners
WTI Partner
Angoss International, Ltd.
DataMind Corp.
Datasage, Inc.
Information Discovery, Inc.
SPSS Inc.
SRA International, Inc.
Thinking Machines Corp.
Product
KnowledgeSeeker IV is a data mining software tool that
uses a unique cross-referencing process to enable
businesses to analyze varied and disparate databases.
DataMind DataCruncher provides fast, accurate data
mining capabilities for making sense of corporate data.
DataSage Mining Manager provides a robust
infrastructure to develop, deploy, and manage enterprise
data mining applications ensuring a complete solution that
will increase corporate profitability and reduce the time to
ROI for data mining projects.
Data Mining Suite is an integrated set of products
providing powerful, complete, and comprehensive
solutions for large-scale enterprisewide decision support
and data mining.
Rapid Pilot Data Mining is designed for Fortune 2000
companies wanting to accelerate the data-mining
introduction process and quickly gain notable results.
Knowledge Access Suite has delivered the first and only set
of products ever to provide business users with a gateway
to knowledge predistilled from raw data and stored in a
pattern base.
SPSS is an open, best-of-breed data mining solution that
delivers each of the four A’s of data mining, access,
analysis, action, and automation.
KDD Explorer is an easy-to-use data mining toolset that
assists business analysts in the discovery and analysis of
novel patterns in terabyte-sized databases.
LoyaltyStream is a complete solution that includes specific
applications, software, user training, and expert consulting
services for understanding customer behavior, building
mining marts, building predictive models, and deploying
models throughout an enterprise.
.....................................................................................................................................................
Data Warehousing Fundamentals
15-33
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
Summary
This lesson covered the following topics:
•
•
Describing the importance of business intelligence
•
Identifying data mining tools
Identifying where data mining might be employed
in a warehouse environment
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-34
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson covered the following topics:
• Describing the importance of business intelligence
• Identifying where data mining might be employed in a warehouse environment
• Identifying data mining tools
.....................................................................................................................................................
Data Warehousing Fundamentals
15-35
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
.
Practice 15-1 Overview
This practice covers the following topics:
•
Identifying the type of analysis based a description
of a scenario
•
Matching the category of information with a list of
description
•
Identifying data mining techniques
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
15-36
Data Warehousing Fundamentals
Practice 15-1
.....................................................................................................................................................
Practice 15-1
1 In the following scenarios, choose the type of analysis that most accurately defines
the scenario. The types of analysis from which you may choose are:
– Query and reporting
– Multidimensional/OLAP
– Data mining
– Drill-down and pivot
– Calculations and derived data
– Spreadsheet
– Modeling, time-series and financial
– What if
Scenario
a. Show start date and salary grade for all employees reporting to
Clare Maury
b. Highlight all orders above $30,000.00
• Drill from product totals to individual orders
• Look at a copy of the invoice
c. Show product sales in each region as a percentage of the total
sales in that region.
d. Did the $2 million promotion increase sales?
e. How many people to hire, when to hire them, and where to
locate them.
f. If we lowered prices, would our overall revenue increase?
g. Find me the relationship between X and Y.
h. Show me all the products that are currently back-ordered.
i. What is the 13 week moving average of sales?
j. Projecting costs and allocating overhead based on head count,
sales forecasts, and consumer price index (CPI).
Type of Analysis
2 For the following phrases and sentences, determine which category each of them
belongs to. You may choose from the following list.
• Data
• Information
• Knowledge
.....................................................................................................................................................
Data Warehousing Fundamentals
15-37
Lesson 15: Supporting End-User Access
.....................................................................................................................................................
•
Decision
Description
Mary lives in Belmont Shores, California.
Point of sale (POS)
AppleTree juice is bought 45% of the time that
Crystal Geyser juice is bought.
Let us promote Crystal Geyser juice on the East
Coast of the United States in stores.
Demographic
Customers of the upper middle class will use 10% of
their annual income during the Christmas holiday
season.
Category
3 The diagram below illustrates an example of data mining. The technique that it
uses is called _________________.
Age
Region
Loyal
Call Rate
Lost
Service
4 The description below describes a data mining technique. What is the technique
used?
1.
2.
3.
4.
5.
6.
If the vehicle has a 2-door frame AND
If the vehicle has at least six cylinders AND
If the buyer is less than 40 years old AND
If the cost of the vehicle is > $35,000 AND
If the vehicle color is red, THEN
The buyer is likely to be male.
.....................................................................................................................................................
15-38
Data Warehousing Fundamentals
16
.................................
Web-Enabling the
Warehouse
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Meeting a
Business
Need
Choosing a
Computing
Architecture
Planning
Warehouse
Storage
Modeling
the Data
Warehouse
ETT
(Building the
Warehouse)
Analyzing
User Query
Needs
Supporting
Supporting
End
End User
User
Access
Access
Managing
the Data
Warehouse
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Objectives
After completing this lesson, you should be able to
do the following:
•
•
•
Explain how the Web can expand data warehouse
usage
Describe the issues involved in putting a data
warehouse on the Web
Outline the requirements for evaluation Web-based
query and analysis tools
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
The previous lesson covered supporting end-user access. This lesson discusses Webenabling the warehouse which is also another aspect of supporting end-user access to
the warehouse. Note that the “Supporting End User Access” block is highlighted in the
course road map on the facing page.
Specifically, this lesson discusses how to take advantage of the Web to deploy data
warehouse information. It addresses internal and external access, as well as the
advantages of Web-enabling a data warehouse. The lesson outlines the steps involved
in deploying a Web-enabled data warehouse. Challenges in deploying a Web-enabled
data warehouse are also discussed.
Objectives
After completing this lesson, you should be able to do the following:
• Explain how the Web can expand data warehouse usage
• Describe the issues involved in putting a data warehouse on the Web
• Outline the requirements for evaluating Web-based query and analysis tools
.....................................................................................................................................................
Data Warehousing Fundamentals
16-3
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Benefits of Web-Enabling
a Data Warehouse
•
•
•
•
•
Better-informed decision making
•
Greater collaboration
among users
Lower costs of deployment and management
Lower training costs
Remote access
Enhanced customer service and improved image
as a technology leader
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-4
Data Warehousing Fundamentals
Accessing the Warehouse Over the Web
.....................................................................................................................................................
Accessing the Warehouse Over the Web
A Web-enabled data warehouse is a means of providing access and query availability
to your data warehouse by using a standard Web browser. It allows your users to
perform ad hoc queries against the database using their choice of Web browsers.
The primary purpose of Web-enabling a data warehouse is to give remote offices and
mobile professionals the information they need to make tactical business decisions.
Companies are increasingly aware that the Internet can help them reach out to new
markets and increase their values to customers, particularly by offering individualized,
one-to-one marketing.
Benefits of Web-Enabling a Data Warehouse
Deploying data warehouse applications on the Web is becoming increasingly popular.
The benefits of a Web-enabled data warehouse are:
• Better-informed decision making: Users with access to more comprehensive
information and analyses can make better decisions, with the results directly
affecting the organization’s bottom line.
• Lower costs of deployment and management: A Web browser serves many clients
from a single location, reducing the number of installations and upgrades needed,
and reducing the cost of support.
• Lower training costs: After a user is trained in the use of a Web browser, the user is
equipped to access and use most of the resources on the corporate intranet.
• Improved return on investment (ROI): Increasing the use of data warehouse
spreads its value among more users and shortens the time for data warehouse ROI.
• Remote access: The ability to put information to use out of the office is greatly
expanded, because through the Web, users can access the information anytime and
anywhere.
• Enhanced customer service and improved image as a technology leader: Up-todate information can be made available immediately to a wide range of users,
allowing them to help themselves and get an immediate response to their
questions.
• Greater collaboration among users: Users can share information and analysis
across organizations.
.....................................................................................................................................................
Data Warehousing Fundamentals
16-5
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Challenges of Web-Enabling
a Data Warehouse
•
•
•
•
•
Security
Business value
Impact assessment
Setup and management
Tools and support for global requirements
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-6
Data Warehousing Fundamentals
Accessing the Warehouse Over the Web
.....................................................................................................................................................
Challenges of Web-Enabling a Data Warehouse
According to the Hurwitz Group, putting a data warehouse on the Web offers
tremendous benefits but it also presents some technical and organizational challenges.
• Security: The loss of data warehouse data to hostile parties can have extremely
serious legal, financial, and competitive impacts on an organization. Make sure
that your solution has strong encryption, authorization, and authentication
services.
• Business value: In order to succeed in Web-enabling your data warehouse, you
need to have a warehouse sponsor who will help to develop a clear business case
for putting the warehouse on the Web. Some of the questions to answer include:
– What are users going to do with the Web-enabled data warehouse?
– Who will you allow to access the Web-enabled data warehouse?
– What will users be allowed to use the Web-enabled data warehouse for?
– How will this affect other departments, such as order processing, sales, indirect
channels and other business partners, and customer support?
• Impact assessment: You need to assess the impact a Web-enabled warehouse will
have on your IT organization and infrastructure. This includes:
– Changes in utilization patterns and the number of active clients
– The need to learn new skills, such as integrating a warehouse database with a
Web server
– Other areas of consideration: Networks, servers, failover and recovery
procedures, development and testing tools, and training programmers as well
as operators
• Setup and management: You need to consider how people will use the warehouse
and what impact their behavior will have on performance, availability, throughput,
and network bandwidth. You need to select among three basic query approaches:
– Static pages
– Dynamic pages
– Dynamic queries
• Tools and support for global requirements: Because putting your warehouse on the
Web stresses its load and capacity, you will need good tools for managing the
system, especially the network and various servers. You must ensure that your
vendors’ support services will meet your global support requirements.
(Source: Robert Craig, Data Warehousing and the Web.
Hurwitz Group. September/October, 1997)
.....................................................................................................................................................
Data Warehousing Fundamentals
16-7
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Common Web
Data Warehouse Architecture
Common Gateway Interface
Gateway
program
Web server
HTML
Warehouse
database
Client
browser
®
Copyright  Oracle Corporation, 1999. All rights reserved.
Common Web
Data Warehouse Architecture
OLAP server
Warehouse
server
Common Gateway Interface (CGI)
Object Request Broker Cartridge
Servlets
Netscape Server API (NSAPI)
Internet Server API (ISAPI)
Web server
Windows
clients
Client
browser
World Wide Web Client
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-8
Data Warehousing Fundamentals
Common Web Data Warehouse Architecture
.....................................................................................................................................................
Common Web Data Warehouse Architecture
The warehouse may be accessed through a browser using a standard gateway
interface. The requestor accesses the Web server, using the Uniform Resource Locator
(URL) address. The protocol between the requestor and the server is hypertext transfer
protocol (HTTP). The text document that travels between the two servers (Web and
requestor) is written using Hypertext Markup Language (HTML).
Warehouses are concerned with real data, not text documents. The Common Gateway
Interface (CGI) facility of the Web server software provides a way of executing server
resident software, such as a SELECT statement, that accesses a relational database.
Building secure applications for the Internet requires a well-thought-out security
strategy as well as the appropriate application architecture. Most Web applications
provide all users with the same access permissions. The information available is either
not confidential or of a low level of confidentiality. The same security issue currently
exists at the database level.
Note: As noted in the bottom slide for Common Web Data Warehouse Architecture,
the communication mechanism between the OLAP server and Web server can either
be any one of the following mechanisms:
• Common Gateway Interface (CGI)
• Object Request Broker Cartridge
• Servlets
• Netscape Server API (NSAPI)
• Internet Server API (ISAPI)
• Other compatible mechanisms
.....................................................................................................................................................
Data Warehousing Fundamentals
16-9
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Issues in Deploying a Data Warehouse
on the Web
•
Security:
– Authentication and authorization
– Communication confidentiality
– Access and restriction management
•
•
Scalability
Availability
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Security
Authentication and authorization:
– Password
– Digital certificates
– Authentication tokens
Communication confidentiality
Access and restriction management
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-10
Data Warehousing Fundamentals
Issues in Deploying a Data Warehouse on the Web
.....................................................................................................................................................
Issues in Deploying a Data Warehouse on the Web
Security
The Computer Emergency Response Team (CERT), an Internet security watchdog
organization, calculates the number of security incidents reported to the center has
grown dramatically, from less than 100 in 1988 to almost 2,500 in 1995.
The leakage of data warehouse information through unauthorized access by hostile
parties can have extremely serious legal, financial, and competitive impacts on an
organization. This is because of access to processed information such as summarized
data, trend analysis, and confidential reports used to make business decisions. Such
leakage may also not be detected. Security is thus of utmost importance to the data
warehouse manager.
To address the security needs, the data warehouse manager needs to pay attention to
authentication and authorization, communication confidentiality, and access and
restrictions management.
Authentication and Authorization According to CERT:
“Authentication is proving that a user is who he or she claims to be. That proof
may involve something the user knows (such as a password), something the user
has (such as a smart card), or something about the user that proves the person’s
identity (such as a fingerprint).
Authorization is the act of determining whether a particular user (or computer
system) has the right to carry out a certain activity, such as reading a file or running
a program. Authentication and authorization go hand in hand. Users must be
authenticated before carrying out the activity they are authorized to perform.”
(CERT, Security of the Internet (Web version). February 1998.)
There are three means for a user to authenticate himself or herself:
• Something the user knows, such as a PIN or reusable password
• Something the user has, such as a smart card
• Something specific to the user, such as his or her palm print or voice
The three most widely used ways are:
• Password: It consists of a string of characters and is the most basic security
measure. Unfortunately, the same password is often used to access different
systems and can be captured or stolen. It is better to use onetime passwords.
• Digital certificates: An electronic certificate that identifies users to ensure the
successful and authorized transfer of information. The certificate identifies its
owner to someone who needs proof of the bearer’s identity.
.....................................................................................................................................................
Data Warehousing Fundamentals
16-11
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Security
Authentication and authorization:
– Password
– Digital certificates
– Authentication tokens
Communication confidentiality
Access and restriction management
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-12
Data Warehousing Fundamentals
Issues in Deploying a Data Warehouse on the Web
.....................................................................................................................................................
Security (continued)
Authentication and Authorization (continued)
• Authentication tokens: These are small one time password calculators with a
display and sometimes a keypad. Some examples of authentication tokens are
smart cards, thumbprint biometric scanning, and retinal pattern biometric
scanning.
More advanced security technologies employ at least two of these three factors of user
authentication and identification. Factor one is a memorized personal identification
number; factor two is a smart card with its displayed code generated at a programmed
interval. The two factors combine to produce a onetime password.
Communication Confidentiality Ensure that third parties cannot eavesdrop on
communications or impersonate communicating parties. Data that is traversing the
Internet should not be readable to unauthorized parties. Encryption, which is the
transformation of data into a form unreadable to anyone without a suitable decryption
key, is often used to protect data confidentiality. The transformation of data into a form
unreadable by anyone without a decryption key.
The two most widely-used types of encryption are symmetric key encryption and
public key encryption.
In symmetric encryption, the same key is used to encrypt and decrypt the message.
Therefore both the sender and receiver must somehow acquire the key before
confidential communication can proceed. This distribution of the key is a point of
vulnerability, and if improperly done, the communication can be compromised.
With public key encryption, one key is used to encrypt and a second different key is
used to decrypt. The first key cannot decrypt the message and can be sent from the
recipient to the sender or even made public. The sender uses this key to encrypt the
message for the recipient. This ensures confidentiality in communication but not
authentication of the sender.
To provide both authentication and communication confidentiality, you can use digital
certificates based on public key encryption. A trusted third party authenticates both
parties by some reliable method and issues them digital certificates.
Access and Restriction Management There should be some way to determine
across the enterprise whether a particular party has certain privileges or access to
valuable resources. When access and restriction management is not controlled in a
unified manner there is a possibility that certain parties may still have authorized
access even though that is not desired. A directory server is often used as a single point
of access and a single point of authentication. Other access management tools are
routers and firewalls. Routers can be configured to restrict the flow of network packets
to selected portions of the network based on message origin and destination.
.....................................................................................................................................................
Data Warehousing Fundamentals
16-13
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Security
Authentication and authorization:
– Password
– Digital certificates
– Authentication tokens
Communication confidentiality
Access and restriction management
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Scalability
•
Main concerns are:
– Amount of data
– Complexity of queries
– Number of areas
– Number of users
•
Potential bottlenecks are:
– Storage capacity
– Memory
– Computational cycles
– Limits on OS resources
– Network bandwidth
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-14
Data Warehousing Fundamentals
Issues in Deploying a Data Warehouse on the Web
.....................................................................................................................................................
Security (continued)
Access and Restriction Management (continued)
Firewalls restrict the flow of traffic from one network to another based on protocols.
Firewalls often include the capabilities of routers. In addition, a firewall can include
the capabilities of a proxy server and make requests to external computers on behalf of
internal network computers. This hides from the users the configuration of the internal
network, such as the name, IP addresses, and OS of internal computers.
Scalability
When enterprises that serve a large population offer service over the Internet, they face
unpredictable demands. In particular, they may have to handle peak and special
demand loads. With many business and government organizations there are potentially
thousands, if not millions of online users. Data warehouse demands also tend to grow
rapidly over time. Web-based access to data warehouse might need a high order of
scalability as well. This means that the system has to be parsimonious in the use of
computing resources per user and should be incrementally extensible through the
addition of computing resources.
The main concerns for data warehouse scalability over the Web are:
• The amount of data
• The complexity of queries
• The number of areas
• The number of users
The amount of data that is stored in a data warehouse is substantially greater than for
most operating databases and continues to grow with time. Anthem’s data warehouse
for example began with 1.3 TB of data and anticipated to grow by 10 times more in
three years. Because users are looking for trends and comparing data, it is typical for
large amounts of data to be sent to the user per request.
The potential bottlenecks are in:
• Storage capacity
• Memory
• Computational cycles
• Limits on operating system resources such as file handles, ports, and locks
• Network bandwidth
Scalability issues should be considered from the beginning to handle both current
needs and future growth. It may be difficult or impossible to make a nonscalable
system scalable after implementation. However, it is more cost-effective if resources
can be incrementally added only as needed, as growth occurs rather than all at once.
.....................................................................................................................................................
Data Warehousing Fundamentals
16-15
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Availability
•
The Internet extends the reach of database
applications throughout the enterprise,
organizations, and communities.
•
More and more data warehouses require 24 X 7
availability
•
Maintenance windows for batch extract, process,
and refresh information for the data warehouse are
shrinking.
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-16
Data Warehousing Fundamentals
Issues in Deploying a Data Warehouse on the Web
.....................................................................................................................................................
Availability
The Internet extends the reach of database applications throughout the enterprise,
organizations, and communities. This reach further highlights the importance of high
availability in data management solutions. Small business and global enterprises alike
have customers all over the world requiring access to data 24 hours per day and 7 days
a week. This is true of many large operational systems but is also becoming the case
for data warehouses. One consequence is that maintenance windows are shrinking or
disappearing. Secondly, failure in one part of the system does not necessarily make the
entire system unavailable.
Maintenance windows are typically used to batch extract, process, and refresh
information for the data warehouse. In the future it becomes important to be able to
perform such maintenance operations on the data warehouse while it is online. This
covers everything from adding disk packs, computers, and data files, to cleaning and
refreshing the data from the operational system; to performing backup, archiving, and
recovery operations.
.....................................................................................................................................................
Data Warehousing Fundamentals
16-17
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Evaluating Web-Based Tools
Requirements:
•
Interactivity
Does the tool provide interactivity that covers
tables, charts, and quadrants?
•
Functionality
Calculations, SQL generation, formatting,
navigation techniques, layout controls
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Evaluating Web-Based Tools
Requirements:
•
Architecture
What generation of Web architecture does
the tool require?
•
Performance
– How quickly can users access the data
they need?
– How long does it take to download
dynamic client-side programs?
– What trade-off does the tool make
between interactivity and
performance?
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-18
Data Warehousing Fundamentals
Evaluating Web-Based Tools
.....................................................................................................................................................
Evaluating Web-Based Tools
Requirements
Wayne Eckerson, from the Patricia Seybold Group, outlined the following
requirements for evaluating Web-based query and analysis tools.
Requirement
Interactivity
Functionality
Architecture
Performance
Specific Questions to Ask
Does the tool provide interactivity that covers tables, charts, and
quadrants?
Note: Most tools provide static viewing capabilities.
Compare the functionality of the Web-based tool to the
functionality of its client-server-based version in the area of:
• Calculations
• SQL generation
• Formatting
• Navigation techniques
• Layout controls
Note: The Web-enabled tool must meet the requirements of your
target audience.
It is important to consider what generation of Web architecture the
tool requires. Specifically consider:
• Does it support a four-tier architecture using CGI interfaces or
native Web server interfaces?
• Does it support a three-tier architecture using Java client and
server and proprietary client-server protocols?
• Does it use Java applets, ActiveX controls, plug-ins, or helper
applications?
• How closely is the tool tracking emerging Internet and Web
standards?
A tool that uses native Web server interfaces will run faster in a
multiuser environment than tools that use CGI. Consider the
following:
• How quickly can users access the data they need?
• How long does it take to download dynamic client-side
programs?
• What trade-off does the tool make between interactivity and
performance?
.....................................................................................................................................................
Data Warehousing Fundamentals
16-19
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Evaluating Web-Based Tools
Requirements:
•
Design
Does the tool require designers to do
coding in HTML or CGI scripts to create
sophisticated HTML reports?
•
Administration
Does the tool control access to reports by
user, group, and role?
•
Output
Can the tool output data in a variety of
formats and languages?
Copyright  Oracle Corporation, 1999. All rights reserved.
®
Evaluating Web-Based Tools
Requirements:
•
Scalability
– What platforms does the tool’s main
execution engine run on?
– Does it support load-balancing?
•
Databases
What databases and native drivers does the
tool support?
•
Pricing
– How much does the tool cost?
– Does the tool support Web pricing?
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-20
Data Warehousing Fundamentals
Evaluating Web-Based Tools
.....................................................................................................................................................
Requirements (continued)
Requirement
Design
Administration
Output
Scalability
Databases
Pricing
Specific Questions to Ask
Does the tool require designers to do coding in HTML or CGI scripts to
create sophisticated HTML reports with drill-down, pivots, and
embedded links?
Note: Design is an important factor to consider. Most tools use their
existing client-server tools to build reports, which are then published in
HTML. However, it is important to know what gets lost in the
translation.
The tool must be able to control access to reports by user, group, and
role. After users log on to the Web server, they should be presented with
a custom menu that shows only those reports that they are authorized to
access. Some of the questions to consider are:
• Does the tool have a utility for managing a great many report files on
a Web server?
• How does it control user access to reports?
• Does it work with existing security features of application servers
and database server?
A good tool will generate HTML for wide-based distribution as well as
reports in native proprietary format for use with helper applications.
Advanced tools should also generate Java for display within a Java
window. Specifically consider:
• Can the tool output data in a variety of formats such as grid, crosstab,
and chart and in a variety of languages such as HTML, Java, and
Excel?
• Which release of HTML does the tool support?
• What platforms does the tool’s main execution engine run on?
• Does it support load-balancing?
• What databases does the tool support?
• Does it support both relational and OLAP databases?
• Does it use native drivers such as ODBC and JDBC?
• Does it support text?
• How much does the tool cost?
• Does it support a Web pricing model?
Note: Many companies are starting to charge by concurrent user and the
size of the server machine rather than by per-seat charges and flat-fee
server pricing.
(Patricia Seybold Group, Wayne Eckerson, Web-Based
Query Tools and Architecture. March 1997)
.....................................................................................................................................................
Data Warehousing Fundamentals
16-21
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Summary
This lesson covered the following topics:
•
Highlighting the main benefits of Web-enabling the
data warehouse
•
Discussing the main issues in deploying a data
warehouse on the Web
•
Specifying the requirements for Web-based tools
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-22
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
This lesson covered the following topics:
• Highlighting the main benefits of Web-enabling the data warehouse
• Discussing the main issues in deploying a data warehouse on the Web
• Specifying the requirements for Web-based tools
.....................................................................................................................................................
Data Warehousing Fundamentals
16-23
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Practice 16-1 Overview
This practice covers the following topics:
•
Completing the Web-based tool requirement
checklist
•
Justifying each response
Copyright  Oracle Corporation, 1999. All rights reserved.
®
.....................................................................................................................................................
16-24
Data Warehousing Fundamentals
Practice 16-1
.....................................................................................................................................................
Practice 16-1
Web-Based Tool Requirement Checklist
For each item in the following list which evaluates Web-based tools requirements, rate
your own organization’s needs and requirements. Rate each item’s relative importance
in measuring your organization’s needs and requirements.
Requirement
Specific Questions to Ask
Interactivity
Does the tool provide interactivity that
covers tables, charts, and quadrants?
Compare the functionality of the Webbased tool to the functionality of its
client-server-based version in the area
of:
• Calculations
• SQL generation
• Formatting
• Navigation techniques
• Layout controls
• Does it support a four-tier
architecture using CGI interfaces or
native Web server interfaces?
• Does it support a three-tier
architecture using Java client and
server and proprietary client/server
protocols?
• Does it use Java applets, ActiveX
controls, plug-ins, or helper
applications?
• How closely is the tool tracking
emerging Internet and Web
standards?
• How quickly can users access the
data they need?
• How long does it take to download
dynamic client-side programs?
• What trade-off does the tool make
between interactivity and
performance?
Functionality
Architecture
Performance
Is This Important to You?
Why?
.....................................................................................................................................................
Data Warehousing Fundamentals
16-25
Lesson 16: Web-Enabling the Warehouse
.....................................................................................................................................................
Web-based Tool Requirement Checklist (continued)
Requirement
Specific Questions to Ask
Design
Does the tool require designers to do
coding in HTML or CGI scripts to create
sophisticated HTML reports with drilldown, pivots, and embedded links?
• Does the tool have a utility for
managing a great many report files on
a Web server?
• How does it control user access to
reports?
• Does it work with existing security
features of application servers and
database server?
• Can the tool output data in a variety of
formats, such as grid, crosstab, and
chart, and in a variety of languages,
such as HTML, Java, and Excel?
• Which release of HTML does the tool
support?
• What platforms does the tool’s main
execution engine run on?
• Does it support load-balancing?
• What databases does the tool support?
• Does it support both relational and
OLAP databases?
• Does it use native drivers such as
ODBC and JDBC?
• Does it support text?
• How much does the tool cost?
• Does it support a Web pricing model?
Administration
Output
Scalability
Databases
Pricing
Is This Important to You?
Why?
.....................................................................................................................................................
16-26
Data Warehousing Fundamentals
17
.................................
Managing the Data
Warehouse
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Overview
Defining
DW Concepts
& Terminology
Planning
for a
Successful
Warehouse
Meeting a
Business
Need
Choosing a
Computing
Architecture
Planning
Warehouse
Storage
Modeling
the Data
Warehouse
ETT
(Building the
Warehouse)
Analyzing
User Query
Needs
Managing
Managing
the
the Data
Data
Warehouse
Warehouse
Supporting
End User
Access
Project Management
(Methodology, Maintaining Metadata)
Copyright  Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to
do the following:
•
Develop a plan for managing the transition from
development to implementation
•
Identify challenges pertaining to the growth of the
data warehouse
•
•
Describe backup and archive mechanisms
Identify data warehouse performance issues
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-2
Data Warehousing Fundamentals
Overview
.....................................................................................................................................................
Overview
This lesson explores the management issues, critical success factors, and challenges to
successful data warehouse implementation. The lesson addresses issues pertaining to
the management of the entire warehouse life cycle.
Note that the “Managing the Data Warehouse” block is highlighted in the overview
slide on the facing page.
Objectives
After completing this lesson, you should be able to do the following:
• Develop a plan for managing the transition from development to implementation
• Identify challenges pertaining to the growth of the data warehouse
• Describe backup and archive mechanisms
• Identify data warehouse performance issues
.....................................................................................................................................................
Data Warehousing Fundamentals
17-3
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Managing the Transition to Production
•
•
•
•
•
•
•
Promoting support for change
Pilot versus large-scale implementation
Documentation
Testing
Training
Postimplementation support
Maintaining the warehouse
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-4
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Managing the Transition to Production
Another set of key management issues surrounds the transition from warehouse
development to production. These issues include:
• Promoting the support of management, developers, and end users for the changes
accompanying the warehouse
• Choosing between a manageable pilot and large-scale implementation
• Documentation
• Testing
• Training
• Postimplementation support
• Maintaining the warehouse
.....................................................................................................................................................
Data Warehousing Fundamentals
17-5
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Promoting Support for Change
Management
Developers
To Support
Not to Support
Competitiveness
Fear of change
Business benefit
Risk avoidance
New skills
Outdated skills
Leading edge
End Users
Faster flexible system
Disruption
Improved tools
Change
Increased workload
Copyright  Oracle Corporation, 1999. All rights reserved.
Methods for Promoting Support
•
•
•
•
•
•
•
Awareness
Feedback
Information
Skills
Education
Direction
Control
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-6
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Promoting Support for Change
Unfortunately, not everyone easily tolerates or accepts the introduction of new systems
and associated technologies. End users and information systems personnel,
particularly, are often bombarded with new systems and technology.
There are reasons why staff may be either for or against supporting the warehouse.
Reasons to Support
Reasons not to Support
Management
Competitive advantage
Benefit from the investment
Fear of change
Risk avoidance
Developers
Opportunity to learn new and valuable
skills
Leading-edge technology
Fear of obsoleting old skill set
End Users
Faster and more flexible systems
Improved and more powerful query
tools
Disruption of routine
Change of toolset
Increased workload
Methods for Promoting Support Given the fears identified, there are ways you can
control transition to this new and exciting but challenging environment. Some of these
may be obvious; however, they are worth stating.
• Ensure that everyone is aware of the benefit the warehouse is going to bring to the
business. A profitable organization is able to grow, compete, adapt, and keep staff.
• Ensure that all staff involved in the warehouse project are aware of what is
happening at each stage. Provide constant and consistent feedback on status,
including problems and successes.
• Ensure that the IT staff are trained with the skills they need (old and new).
• Provide users with the training necessary to use the query tools effectively and
imaginatively.
• Keep the project on course. Do not let any phase of development skip without
understanding why, and learn for the next increment. Monitor progress constantly.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-7
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Choosing Between Pilot and Large-Scale
Implementation
Large-Scale
Implementation
Pilot
Copyright  Oracle Corporation, 1999. All rights reserved.
The Warehouse Pilot
•
Demonstrates benefits to:
– Management
– Users
– IT staff
•
•
•
•
•
•
Relevant to the business
Low technical risk
Small and feasible
Anticipates increased use
Focused on an initial business issue
Remains in context
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-8
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Choosing Between Pilot and Large-Scale Production
This choice should have been already made at an earlier planning stage. The preferable
choice is a pilot, the success of which can be leveraged into further incremental
rollouts.
The Warehouse Pilot
The pilot demonstrates benefits to management, end users, and IT staff.
Management
the business.
The warehouse can provide current and ongoing financial benefits to
End Users The types of information available, the flexibility of the tools, and the
type of analysis possible.
IT Staff Whether their strategy and development plans were appropriate. Changes
can be made prior to developing the next increment.
Essential considerations for the pilot are to:
• Ensure that the subject matter chosen is relevant to the business. Thus, the pilot
may focus on an initial business issue such as sales or marketing.
• Have a low technical risk by starting small and feasible. It may be that the pilot
data comes from a single relational source and therefore is most likely to succeed
as a proof of concept. Further iterations may extract data from diverse sources.
• Anticipate significant use.
• Ensure that the pilot, however small, remains within the context of the larger
vision.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-9
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Piloting the Warehouse
•
•
•
Designers
Prove model, data, and access tools
Users
– Prove ease of use of tool
– Check data and query performance
– Identify training requirements
Developers
– Resolve ETT and metadata issues
– Determine users data and training
requirements
– Test security and access levels, monitor
performance
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-10
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Piloting the Warehouse
You can position the pilot, or prototype, as the starting point of the iterative warehouse
process mentioned earlier. It is a vital part of the implementation.
The pilot must cover all aspects of implementation and ensure user involvement at
every step in the process or phase of the life cycle. A specific subject area of the
warehouse is targeted for the pilot, and the query tools selected should be available to
the users for data access.
The pilot fulfills a number of tasks, including those in the following list:
• It enables the designers to prove the model, the data, and the access tools.
• It enables the users to:
– See how easy the access tool is to use
– Enhance their data requirements
– Identify their training requirements
– Measure query performance
• It allows the developers to:
– Determine whether the ETT process is adequate and modify it accordingly
– Identify any issues with the metadata presented to the users or used by ETT
– Determine the users’ near future and possibly even long-term requirements
– Identify and define the users’ training needs
– Test access levels and security of the systems and data
– Monitor performance
Several things must be agreed upon before piloting:
• You must ensure that acceptance criteria are documented and agreed upon.
• You need to identify volume and scalability tests and develop a test plan with test
cycles.
• Once the tests are executed, you can gather statistics on performance and optimize
where necessary.
You must test the entire process of refreshing the data, and produce a report that
contains a complete and detailed evaluation of this proof of concept.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-11
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Documentation
Produces textual deliverables:
•
•
•
•
•
•
Glossary
User and technical documentation
Online help
Metadata reference guide
Warehouse management reference
New features guide
Copyright  Oracle Corporation, 1999. All rights reserved.
Testing the Warehouse
•
•
Test every stage
Use a realistic test database and environment
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-12
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Documentation
This process centers on producing all user and technical documentation for the data
warehouse, including references, user and system operations guides, and online help.
Metadata Reference Guide To ensure active and successful use of the warehouse,
the metadata reference guide describes the contents of the data warehouse in business
terms and provides a navigational road map to the contents of the warehouse.
Warehouse Management Reference The warehouse management documentation
outlines the workflow and procedures (both manual and automated).
New Features Guide The new features guide highlights any enhancements to
warehouse functionality that results from the implementation of the solution.
Testing the Warehouse
Do not assume, “No problem, it will work.” Always test components.
Test Database Testing is required at every stage of development, involving every
component, ideally on a test database, using a machine and network setup as close as
possible to the planned production environment.
If you are using Oracle Data Warehouse Method, testing is a specific requirement
during most phases and for many tasks.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-13
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Training
•
•
Users
– Metadata
– DSS tools
– Ad hoc queries
– Getting help
– Registration of enhancement requests
Information systems developers:
– Analysis techniques
– Hardware technicalities
– Networking
– Implementing, building, and supporting DSS
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-14
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Training
During project planning, allocate time and resources to educating key information
technology staff, end users, and management personnel about data warehousing and its
benefits. Education begins at the start of development and continues right through to
the end and on to further iterations.
Educating Users Educating users on how to access data is one of the most critical
areas of warehouse training. Always ensure that representatives from each user group
are invited to courses and workshop sessions.
Users need to know how:
• The metadata represents the business data
• To use the decision support tools to answer business questions
• To create ad hoc queries and save data results
• To contact the help desk or support group for assistance
• To register requests for enhancements through a formal change management
process
Educating IT Staff Information systems staff need education in the following areas:
• How to communicate and understand people issues
• Business analysis techniques
• Technical aspects of the hardware architecture
• The network environment
• Decision support and OLAP tools—implementing, building, and supporting
Educating everyone involved with the warehouse is more critical for the first
implementation. Everyone must be made aware of what the warehouse is, even if they
are not directly involved with the project.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-15
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Postimplementation Support
•
•
Evaluate and review the implementation
Monitor the warehouse:
– Respond to problems
– Conduct performance tuning
– Roll out metadata, queries, reports, filters, and
conditions
– Implement security
– Incorporate new users
– Distribute data marts and catalogs
– Transfer ownership from IT
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-16
Data Warehousing Fundamentals
Managing the Transition to Production
.....................................................................................................................................................
Postimplementation Support
This process provides an opportunity to evaluate and review the implementation. You
access metadata and evaluate queries and reports run against the warehouse. The
information assists with managing standard queries and reports and the user layer and
identifies required indexes.
Monitoring the Warehouse After implementation, you will need to monitor the
warehouse continuously to manage the following:
• Monitoring and responding to system problems
• Conducting performance and tuning activities for all components of the data
warehouse
• Rolling out metadata, queries, reports, filters, and conditions
• Implementing security
• Incorporating new users
• Distributing data marts and catalogs
• Transferring ownership (responsibility for the data warehouse may be transferred
from IT personnel to the owning organization.)
.....................................................................................................................................................
Data Warehousing Fundamentals
17-17
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Managing Growth
Expanding user numbers
Number of Users
300
250
200
150
100
50
24 Months
12 Months
6 Months
3 Months
Initial
0
Period after Implementation
Source: Data Warehouse Institute Flash Report, January 1996
Copyright  Oracle Corporation, 1999. All rights reserved.
Types of Growth
•
•
•
Increasing number of users
Broader usage
Growth of data volumes
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-18
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Managing Growth
The table below is the result of a survey showing that the number of users accessing
the successful warehouse grows substantially during the first two years. You can see
that between 12 and 24 months there is a substantial rise in use.
Number of Users Actively Querying the Warehouse
Period
Large and small sites
Small sites
Initial Number
16
6
After 3 months
19
12
After 6 months
44
20
After 12 months
99
28
After 24 months
255
55
Once the benefits of the warehouse become tangible to the user community, demands
on the warehouse increase dramatically.
The table and chart are sourced from the Data Warehouse Institute Flash Report,
January 1996.
Types of Growth
• Increasing number of users
• Broader, more varied usage
• Growth of data volumes
The database increases in size through the accumulation of historical data and addition
of new subject areas.
Warehouse usage increases through the availability of new decision support
functionality and evolving empowerment of the user population.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-19
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Expansion and Adjustment
•
•
•
Evaluate continually:
– Changes
– New increments
– Unnecessary components
– Strategies
Ensure open environment
Document development processes for the future:
– Planning
– Cost analysis
– Problem assessment and correction
– Performance assessment
Copyright  Oracle Corporation, 1999. All rights reserved.
Controlling Expansion
Control by
•
•
•
Ensuring the continuity of staff
•
Creating a strategy for maintaining changes to
data
Documenting processes, solutions, and metrics
Establishing working test and production
architecture for further increments
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-20
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Expansion and Adjustment
Continually evaluate the warehouse to identify:
• Changes that can be made
• Additional increments (although this is usually identified in the primary strategy
phase of development)
• Components that may be removed (for example, unused summaries)
• Optimal indexing and performance strategies
Openness for the Future An open architecture and toolset is required to suit current
and future requirements.
Document for the Future You should document the process used in developing the
data warehouse solution and collect metrics, as an aid to:
• Future planning
• Further and future cost analysis of current or new projects
• Identification of errors and inadequacies that can be eliminated for the next project
• Assessing tool performance
Note: The DWM Transition to Production Phase creates tasks for these
postimplementation issues. The Discovery Phase evaluates all warehouse components.
Controlling Expansion
To control the expansion and adjustment process, and to promote its success, you
should:
• Ensure the continuity of staff on warehouse projects
• Document the process used in developing the warehouse solution and metrics
• Establish a working test and production architecture that can be used for further
increments
As organizational structures change, the historical data reflects a different story.
Determine a strategy for managing changes to the data.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-21
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Sizing Storage
•
•
•
•
•
•
Consider different methods
Determine the best for your needs
Know the business requirements
Do not underestimate requirements
Plan for growth
Consider space for unwanted data
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-22
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Sizing Storage
Sizing data storage (or capacity planning) takes place at a number of stages, for each
increment of the data warehouse solution. It is often revised before being finalized.
Sizing must take into account all the object space needed, not just the database itself
with the warehouse data.
Do Not Underestimate Capacity planning is an art in itself. There are many objects
for which space must be accurately estimated, such as tables, indexes, logs, sort areas,
and temporary space. You may think that this is not much different from the
operational system; however, with the warehouse you are looking at very large
databases with very large space requirements.
It is all too common, when sizing, to forget these additional objects.
Planning for Growth In addition, your early planning stages must consider the
growth of these areas. The data warehouse grows exponentially once implemented, at
every refresh cycle, and space must be available for that growth.
Removing Unwanted Data When data is not needed, it is either purged (removed
and never used again) or archived (for possible later use). Consider the space and
location of archive data. Pay careful attention to determining the storage requirements
for the warehouse. This includes space for:
• Data—fact, dimension, reference, and summary
• The staging file store
• Indexes
• Backup and recovery strategies
• Temporary files
.....................................................................................................................................................
Data Warehousing Fundamentals
17-23
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Estimating Storage
•
•
•
•
•
•
Fact volumes
Fact lifetime
Technology availability
Technology purchase
Storing pre-summarized data
Mirroring or other techniques
requiring disk storage
Copyright  Oracle Corporation, 1999. All rights reserved.
Objects That Need Space
•
•
•
•
•
•
•
•
ODS
Indexes and metadata
Summary data
Redo logs
Rollback information
Sort areas
Temporary space
Workspace for backup and recovery
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-24
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Estimating Storage
In any discussion on this subject, you find a vast array of different ideas, opinions,
methods, and approaches. There is no one single recommendation. You need to
consider the different approaches that are possible, choose the best for you and your
data warehouse, and keep it simple. You should never underestimate the amount of
space needed in the data warehouse.
In order to estimate accurately, you need to answer some simple questions about your
data:
• What is the expected volume of core fact data?
• What is the lifetime of core fact data?
• Do you have the technologies to support that volume?
• If not, do you need to purchase the technologies?
• How important is storing pre-summarized data?
• Does your recovery strategy involve mirroring or other techniques requiring disk
storage?
Objects That Need Space
A detailed understanding of available data is essential for planning capacity at an early
stage. Capacity planning is ongoing throughout the life of the warehouse. Consider
disk requirements:
• Intermediate data store (This is sometimes implemented as an Operational Data
Store (ODS) and referred to as a staging area. It holds data that has been extracted
from source systems, prior to being loaded into the warehouse.)
• Indexes, of which there may be many more than in normal operational systems
• Metadata that contains the map to the warehouse structure and content
• Summary data that comprises aggregated data
• Redo logs and rollback information
• Sort areas and temporary space
• Load files moved to the server
• Workspace for backup and recovery
.....................................................................................................................................................
Data Warehousing Fundamentals
17-25
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Test Load Sampling
•
Analyze statistically significant data
samples
•
•
•
Use test loads for different periods
Reflect day-to-day operations
Include seasonal data and worst
case scenarios
– Calculate number of
transactions
– Employ average sales price
approach
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-26
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Test Load Sampling
You have to decide on the capacity planning technique that suits you best. You may
already have a method that is successful for your operational environment and can be
enhanced for VLDBs and other warehouse objects, such as ODSs.
A good approach to sizing is based on the analysis of a statistically significant sample
of the data.
Test loads can be performed on data from a day, a week, a month, or any other period
of time. Care must be taken that the sample periods reflect the true day-to-day
operations of your company, and the results include any seasonality issues or other
factors, such as worst-case scenarios that may otherwise prejudice the results. Once
you have determined the number of transactions based on the sample, then you
calculate the size by using the average sales price approach.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-27
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Average Sales Price
Assume transaction-level grain.
Total company revenue
$20 billion
Avg sale price per line item
$5
Number of line items per year
$20 billion / $5 = 4 billion
Number of base fact records
4 billion x 3 yrs = 12 billion
Key fields
4 (x 4 bytes)
Fact fields
4 (x 4 bytes)
Base fact table size
12 billion x 8 fields x 4 bytes
= 385 GB
Copyright  Oracle Corporation, 1999. All rights reserved.
Average Sales Price
Use other methods:
•
•
It is difficult to obtain an accurate average
You can achieve inaccurate calculations
Do not use this approach on its own
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-28
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Average Sales Price
The following calculation shows how to estimate the amount of direct-access storage
device (DASD) needed for three years’ worth of data, using an average sales price
algorithm.
Total company revenue
Average sales price per line item
on an individual customer receipt
Number of line items per year for
total business
Number of base fact records
Number of key fields
Number of fact fields
Total fields
Base fact table size
$20 billion
$5
$20 billion / $5 = 4 billion
4 billion × 3 (years) = 12 billion
4 (assume 4 bytes per field)
4 (assume 4 bytes per field)
8
12 billion × 8 fields × 4 bytes = 385 GB
If you take your company’s annual gross revenues and divide by the average revenue
per transaction, then multiply this figure by the length of the row (key columns and
data columns) in your fact table, you have the amount of DASD needed for a year's
worth of data.
You should never use this approach on its own; it is simplistic. The problem is that it is
difficult to get the average revenue per transaction. It is unusual to have a set price
point or even a relatively narrow price range for the products offered by any company.
Many companies have products that sell in volume at relatively low prices, say $5, and
they may have low-volume big-ticket items as well, all of which distort the average.
For example, if the average used is $5, you need 385 GB of DASD, but if the average
is in reality $10, you need only 192 GB of DASD.
Note: This approach is one that is recommended by Ralph Kimball, and takes a
business view rather than a technical view to sizing.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-29
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Other Techniques and Considerations
•
•
Queuing models
Rule of thumb
Total database size is three to four times the size
of the base fact tables
•
Consider:
– Sparseness
– Dimensions
– Indexes
– Summaries
– Sort operational space
Copyright  Oracle Corporation, 1999. All rights reserved.
Space Management
•
•
•
•
•
•
Monitor
Avoid fragmentation
Test load data
Plan for growth
Know business patterns
Never let space become an issue
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-30
Data Warehousing Fundamentals
Managing Growth
.....................................................................................................................................................
Other Techniques and Considerations
Queuing Models
Mathematical models can predict response time on throughput.
Rule of Thumb This rule is often quoted within Oracle. Depending upon the
database server and end-user tools, the total database size is three to four times the size
of your base fact tables.
Other Considerations
• Sparseness of data in fact tables
You must also consider the sparseness of data. Fact table data is generally sparse;
relatively few of all the possible key value combinations are present. Summary
tables are not considered sparse. That is, they contain values for every possible key
value combination.
• Large dimension tables
• Significant increase in size of database caused by indexes
• Large summary tables. Sometimes they occupy as much space as the base fact
table. There may be hundreds of summary tables for a warehouse implementation.
• The need for sort operational space for sorting and loading
Note: You may consider using leasing and chargeback strategies for any excess
storage capacity, especially in a massively parallel processor (MPP) configuration.
Space Management
You have determined a technique for planning capacity and are aware of the numerous
objects that need space; you need to consider management of this space:
• The space usage must be monitored and any fragmentation noted and resolved.
• You should load test sets of data and consider careful analysis (use the ANALYZE
command) to estimate average row length and rows per block, to predict whether
you have sufficient capacity.
• You need to consider how the database is going to grow, and plan for additional
storage accordingly. Fact data grows rapidly, depending upon the refresh cycle
frequency; it grows every time a refresh occurs.
• Knowing the patterns within your business is key to planning these requirements.
Never allow space to become an issue in a warehousing environment; you can see,
with all the operations discussed, how important it is.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-31
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Managing Backup and Recovery
•
•
•
Business requirements for availability
Fast recovery essential
Strategy:
– Defined
– Tested
– Proven
– Evolving
Copyright  Oracle Corporation, 1999. All rights reserved.
Backup Strategy
•
Is based on the business requirements and the
cost benefit
•
Involves large volumes of data:
– All objects except temporary tablespaces
– Incremental
•
Includes first-time load
and refreshes
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-32
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
Managing Backup and Recovery
Availability
Availability is the key requirement of mission-critical data warehouses; recovering
after any type of failure must happen fast.
Some companies demand round-the-clock availability, making partial recovery
mechanisms imperative.
Backup Strategy To recover the database quickly implies that there is a welldefined, tested, and proven backup strategy, as well as a disaster recovery strategy in
case of fire, flood, or infestation.
Evolving Strategies Ensure that as the warehouse evolves, the backup and recovery
strategies also evolve synchronously. Test the backup and recovery procedures
constantly to ensure that they are relevant to your current environment.
The strategy you deploy is based on the business requirements and the cost benefit.
The strategy is not just when and what to back up, but what tools and utilities you are
going to use.
Backing up data is different in the data warehouse environment. You are dealing with
much larger volumes of data than operational systems and higher availability
requirements.
What to Backup Everything in the data warehouse must be the subject of backup,
except temporary tablespaces; that is, the data and tables, metadata, indexes,
constraints, stored procedures, and triggers.
When to Backup A critical part of your overall strategy is to determine when the
database needs to be backed up. This is no different from an operational environment,
except that the frequency of changes to data is unlikely to be as great as that in the
operational environment.
You should back up after the first-time load, after incremental refreshes, and after any
changes to the database structure, such as adding fact or summary tables. Incremental
backups are used, because the data is static between loads.
You need to outline the strategy to include full and incremental backups as you would
in an operational environment.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-33
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Defining the Strategy
•
•
Mission-critical systems
SLAs:
– Defined downtime
– Acceptable MTBF
•
•
Efficient backup and recovery
Evaluation of different
technologies
Copyright  Oracle Corporation, 1999. All rights reserved.
Planning for Backup
•
•
•
Plan at the design stage
Use hot backups for VLDBs
Back up necessary
components:
– Fact and dimension data
– Warehouse schema
– Metadata schema
– Metadata
•
Export/Import utility:
–
Disk space
–
Time
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-34
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
Defining the Strategy
All backup and recovery strategies and tasks must outline and mirror the fact that the
data warehouse contains valuable mission-critical data.
A service level agreement (SLA) is drawn up between yourselves and the customer
(the users) in the early stages. The SLA should at least define what downtime means
(each user may have a different perspective on this) and the acceptable mean-timebetween-failure (MTBF) figures.
The backup hardware environment must be as efficient as possible, considering the
implications and technicalities of deploying RAID, striping, mirroring (some parts of
the database need to be mirrored, others can employ RAID), or partitioning (backup
partitions of data rather than an entire database).
Planning for Backup
The backup and recovery strategy for a warehouse needs to be considered at the design
stage. Details such as how the data is partitioned greatly affect the strategy. For small
and medium databases, daily cold backups (taken while all instances of the database
are shut down) and export/import are viable backup tools.
However, once you move to VLDBs, complete cold backups become difficult to fit
into an overnight window. In addition, the disk space required for a complete export of
a large database becomes an issue. You need to consider other strategies such as using
tape or other devices.
The defined backup strategy for the warehouse should allow for hot backups, where
you can back up any part of the database at any time of the day, while the database
instances are still active. With Oracle, this means backing up individual and active
tablespaces.
You should back up every component that is essential to warehouse operations;
everything required to restore a working environment: fact data, dimension data, data
warehouse and metadata schema, and data warehouse metadata.
Export/Import
The export/import utility enables an entire or part of a database to be extracted into a
dump file and then imported into another database (under another owner if required).
Generally, import/export of a VLDB uses too much disk space. You could use named
pipes to a disk on a UNIX system to overcome space problems. However, this
technique would be very time-consuming.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-35
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Backup Tools
•
•
•
Oracle7 Enterprise Backup Utility
Oracle8 Recovery Manager
Utilities:
– Import and Export
– Operating system
– Third party
Copyright  Oracle Corporation, 1999. All rights reserved.
Parallel Backup and Recovery
•
Parallel Backup
Runs simultaneously from any node
– Off-line
– Online
•
Parallel Recovery
Runs simultaneously from redo logs
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-36
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
Backup Tools
Oracle7 Enterprise Backup Utility (OEBU) This provides a user-friendly
interface, documentation, and the recording of backup details in a recovery catalog.
Oracle8 Recovery Manager (RMAN) Oracle8 Recovery Manager creates image
backups and incremental backups. RMAN stores the information from multiple data
files (or archive logs, but not both) in a backup set, stored in a format that cannot be
processed directly (similar to the Export.dmp file principle). RMAN performs either
cold or hot backups.
OEBU and RMAN are very useful in the VLDB environment to ensure that tasks
occur without error.
Utilities
• Oracle Import and Export
• Operating system utilities, such as UNIX cpio or tar commands, VMS
EXCHANGE, and Windows NT ocopy73.exe or ocopy80.exe
• Third-party utilities that provide a user-friendly layer over operating system
backups
Parallel Backup and Server
Parallel Backup With parallel operations, backups can be performed
simultaneously from any node of a parallel server.
• Online backups enable the database to be backed up while active, allowing users
continuous access.
• Offline backups enable the database to be backed up while shut down, preventing
user access.
Parallel Recovery The goal of parallel recovery is to employ I/O parallelism to
reduce the elapsed time required to perform crash recovery, instance recovery, or
media failure recovery. The server uses one process to read files sequentially and
dispatch redo information to several recovery processes to apply the changes from the
log files to the data files.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-37
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
System Failures
•
•
•
•
Process
Database instance
Media
Natural disaster
Failures are costly
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-38
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
System Failures
Obviously, system failure can be very costly in a warehouse environment. The causes
fall into four categories:
Process Failure Strict and rigorous testing of your plans should prevent this
situation occurring on a regular basis; however, you cannot afford to ignore the fact
that it may happen. Identify an approach to monitoring processes and detecting errors
and a mechanism for reapplying the failed processes.
Database Instance Failure Instance failure occurs when the Oracle SGA and
background processes cannot work. Failure is typically caused by:
• Hardware problems such as power failure
• Software problems such as an operating system crash (hanging)
In an instance failure, data in buffers not yet written to disk will be lost.
Media Failure Media (disk) failure occurs when errors are detected writing or
reading data from disk. It is often caused by disk head crash and affects different types
of file such as data files, redo logs, and control files.
Media failures mean that data in buffers not yet written to disk is lost.
Natural Disasters Natural occurrences such as flood and fire may result in the
system becoming unusable.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-39
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Disaster Recovery Requirements
•
•
•
•
•
•
•
Replacement or standby machine
Tape and disk capacity
Communication links to users and data
Copies of software
Database backup
Administration and operations staff
Documentation
Copyright  Oracle Corporation, 1999. All rights reserved.
Disaster Recovery Planning
•
•
•
•
•
•
Establish the strategy
Prepare the strategy
Maintain the strategy
Audit the strategy
Test recovery plan regularly
Gain approval from users
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-40
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
Disaster Recovery
Protecting your investment is of the highest consideration. A disaster occurs when a
major site loss takes place; usually the site has been destroyed or damaged beyond
immediate repair.
Requirements
Recovering from disaster requires the following facilities:
• A replacement, or standby, machine
It does not have to be as large as the main machine but must have sufficient
capacity to run a minimal system and the power to allow the recovery to take place
on a meaningful timescale.
• Sufficient tape and disk capacity to perform the recovery on a reasonable timescale
Having sufficient disk space to run the minimum independent system is not always
enough. You may need extra disk capacity to allow initial recovery to happen on a
reasonable timescale.
• Communication links to and from users and data owners
• Communication links to and from data sources
If the system is to be accessible to users, the communication links they need to
access the machine must be in place. The links must have sufficient bandwidth and
capacity. This is particularly important if the links are already in use by other
systems. There is no point in putting a disaster system in place if the users cannot
use it.
• Copies of all relevant pieces of software and licensing agreements
• Backup of database
• Application-knowledgeable systems administration and operations staff, along
with current documentation in written or electronic format
Planning
You should thoroughly test the disaster recovery plan on a regular basis, say every six
months. New versions of systems, software, and data are constantly being added and
the frequency of the test must take into account these ongoing changes. The strategy is
normally audited: you need someone to establish, prepare, and maintain the strategy.
The plans must be approved by the business and information systems users.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-41
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Archiving Data
•
•
•
•
Determine data life expectancy
Identify archive frequency
Use read-only tablespaces
Plan and design into early specifications
Copyright  Oracle Corporation, 1999. All rights reserved.
Purging Data
•
Reduce data volumes:
– Create summaries.
– Remove unwanted base data.
•
Choose the most effective method.
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-42
Data Warehousing Fundamentals
Managing Backup and Recovery
.....................................................................................................................................................
Archiving Data
The warehouse design needs to estimate and accommodate the data life expectancy.
Establish how long you want to hold data before removing it completely from the live
database. You may be required to archive old data to tape, or to another database. In
small and medium warehouse databases, the amount of data involved is generally
small; in larger databases the data volumes involved may be significant.
Read-only Tablespaces Data warehouse databases, due to their size, require a
backup method that is as fast as possible and reduces the amount of data to be backed
up. You should use partitioned read-only tablespaces that enable you to archive the
tablespace while it is read-only mode.
• You do not have to back up a read-only tablespace after making the first backup.
• Read-only tablespaces reduce the cost of archive storage. They can be stored on
less expensive media such as a CD-ROM. Ensure that the device to which you are
writing can be accessed quickly.
• As part of your archive strategy, you can use read-only tablespaces to hold
infrequently accessed data.
Data archiving can impose an ongoing heavy load on the system; if you do not plan for
this in the design and implementation, it can have a detrimental effect on performance.
Purging Data
You may be able to reduce the amount of data held by summarizing and aggregating
older data. For example, you may be able to summarize data into monthly and weekly
summaries at the end of each month, and then remove the detail fact data. This data
should be stored offsite in case it is needed to re-create the summary files.
When you remove data, always choose the most cost-effective method in terms of
CPU and database resources. For example, in the case of Oracle, use the DROP table
command (if the table is partitioned) rather than the DELETE command to remove the
unwanted rows. The DROP command does not create rollback and redo information.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-43
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Improving Query Efficiency
•
•
•
•
Improve database design
•
•
Run large jobs out of hours
•
Use indexes
Use governors
Use prepared and tested
queries
Oracle 8i Resource Manager can guarantee
resource availability to specific groups
Use data marts
Copyright  Oracle Corporation, 1999. All rights reserved.
Network Performance
•
•
Provide sufficient bandwidth
•
•
•
•
•
•
Identify middleware requirements
Provide optimal configuration
for access
Know refresh volumes
Consider interaction with job scheduling software
Use client-side processing
Deploy data marts
Analyze traffic
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-44
Data Warehousing Fundamentals
Identifying Data Warehouse Performance Issues
.....................................................................................................................................................
Identifying Data Warehouse Performance Issues
Improving Query Efficiency
The basic design has many implications on performance. A poor design is never going
to provide efficient access; consider redesigning the database.
• Ensure that indexes exist on key values to minimize full-table scans.
• Always use the SELECT command to obtain the minimum amount of data
required.
• Administer resource governors—query blocking—on the server and with the tools
where they have governing capabilities.
• Make available the use of prepared and pretested queries.
• Submit large jobs out of working hours, or when CPU usage and network and I/O
contention are minimum.
• Oracle 8i Resource Manager can guarantee resource availability to specific groups.
In addition to the above considerations, you may also consider using a data mart
strategy to offload query actions to a smaller subset of the warehouse data.
Network Performance
The data warehouse environment is commonly distributed (a data warehouse feeding
data marts), using networks to provide data transfer mechanisms. The network must be
planned and set up to meet data movement and access requirements. Users should not
have restricted access to data. You need to:
• Ensure that the network has an appropriate bandwidth particularly for load
processing.
• Ensure the configuration of the environment is optimal for user access to data.
• Identify whether any middleware is needed to convert data or read non-Oracle
data.
• Identify update frequencies and ensure the network is capable of handling the
volumes.
• Consider how the job scheduling software interacts with the network setup.
• Use tools that perform intensive processing activities (such as summarizing and
sorting) on the client side, or the server itself may perform these activities.
• Deploy data marts at remote locations.
Analyzing Network Traffic You should consider using tools to analyze current
activity and aid in the preliminary planning of the requirements.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-45
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Review and Revise
Monitor the warehouse:
•
•
•
•
•
Usage
Access
Accurate grain
Detail data
Periodicity
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-46
Data Warehousing Fundamentals
Identifying Data Warehouse Performance Issues
.....................................................................................................................................................
Review and Revise
Once the data warehouse is in use, you should monitor it and determine the data that is
being accessed, and the frequency of that access.
You should also use this information to determine whether the grain of the data is right
for the user requirements. Often data may have to be stored at different levels of
granularity to answer sophisticated user queries. This is referred to as multiple
granularity. If a user often requests simple annual sales figures for a given product, this
may be satisfied with a summary table. If the user requests sales figures for a product
by month, then you can provide the same information from 12 time-series tables. Of
course, this involves extra processing.You need to determine early on the levels of
granularity, and how long they are to remain in place in the warehouse.
You should balance the issues against your requirements and resources:
• How often is detail data access required? This determines the real need for details
and their duration.
• What are the benefits of keeping detail for a specified period?
• Do the benefits outweigh the cost in machine resources?
These questions, and others, can be answered in part with stringent query monitoring
to give you usage information. Use this to calculate benefits against costs.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-47
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Secret of Success
Think
big start small
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-48
Data Warehousing Fundamentals
Identifying Data Warehouse Performance Issues
.....................................................................................................................................................
Secret of Success
Your eventual goal may be the enterprisewide solution, but take small steps to achieve
it. The enterprisewide warehouse is not a realistic objective for your first pass. Always
use the proven low-risk incremental approach.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-49
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
Summary
The successful warehouse:
•
•
•
•
•
•
•
Is driven by the business
Focuses on objectives
Adds value to the business
Can be understood and used
Delivers good data
Performs well
Belongs to the users
Copyright  Oracle Corporation, 1999. All rights reserved.
.....................................................................................................................................................
17-50
Data Warehousing Fundamentals
Summary
.....................................................................................................................................................
Summary
Success is achieved if the data warehouse:
• Is driven by a business community with clearly identified requirements.
Remember that this is the primary objective of the data warehouse, and the users
must be responsible for driving the end result.
• Focuses on the objectives outlined in the early stages of development.
• Adds value to the decision making process, and can be seen to provide value with
better and proven results. It is important that you define the measurement of the
success of the warehouse. Without any measures, you cannot determine whether
the warehouse has added value.
• Can be understood by the business community. The data in the warehouse must be
understood to ensure that the users are capable of using it to full effect. The data
must also mean the same to all users. For example, an algorithm that provides a
statistic must be documented in a way that every user can understand.
• Is used by the business community because the value it delivers is tangible. If the
data warehouse does not deliver quality information with integrity that adds value
to the business, then it will not be used.
• Performs as defined by the users in any agreements outlined early in development.
• Belongs to the users and not the IT department.
.....................................................................................................................................................
Data Warehousing Fundamentals
17-51
Lesson 17: Managing the Data Warehouse
.....................................................................................................................................................
.....................................................................................................................................................
17-52
Data Warehousing Fundamentals
A
................................
Practice Solutions
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 2-1
Answer the following questions.
1 OLTP databases hold up-to-the-minute information and are most commonly
designed as read-only databases.
True
False
The correct answer is False because OLTP databases are not read-only databases.
2 In the scenario below, state whether it refers to an operational system or an
analytical processing system.
“Show me how a specific brand of printer is selling throughout different parts of
the United States and how this specific brand of printer is selling since it was first
introduced into my stores.”
This scenario refers to:
a An operational system
b An analytical processing system
The correct answer is B because comparing sales between the different territories
within the United States can provide a certain type of analytical information.
3 Who is the target audience for the data warehouse?
a The business community in the organization
b IT professionals
c Data-entry clerks
d None of the above
e All of the above
The correct answer is A because the main reason for having a data warehouse is to
aid the business community in making better decisions.
4 Are the following statements true or false?
a Operational systems display the following qualities:
Good performance
T
Static data contents
F
High availability
T
Unpredictable CPU use
F
b Identify the reasons why business analysis is not easy with operational
systems.
Data is not structured for drill-downcapability.
T
.....................................................................................................................................................
A-2
Data Warehousing Fundamentals
Practice 2-1
.....................................................................................................................................................
The system is not designed for querying.
F
Data analysis can be CPU-intensive.
T
Data is not integrated between systems.
T
5 In groups of three or four, discuss the questions below and present your points to
the class at the end of the discussion.
a List some of the reasons that your company is considering implementing a data
warehouse or data mart.
b What are some of the business problems that your company is trying to
answer?
c Why is the business community in your organization unable to find the
answers to their business questions based on the existing information systems?
General Answers
Why data warehousing? According Aaron Zornes, from the Meta Group, “IT
organizations are under tremendous pressure to provide better quality decisionmaking information in forms easy to access and manipulate. Business users are
reacting to their own mission-critical needs for better information due to rapidly
changing, increasing volatile and competitive markets, as well as ever-shortening
product life cycles.” Enterprises must become more competitive and get closer to
their customers to survive. Some of the reasons as to why existing information
systems are unable to provide the answers to business questions are:
– Much of the enterprise data is locked up in data “jailhouses”
– Operational systems are unable to provide a consolidated view of data
– Answering some of the business questions requires analyzing data patterns and
trends over time. This often requires large volumes of historical data.
Operational systems do not keep historical data. Therefore such type of
analysis cannot be done in an operational system.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-3
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 3-1
1 Indicate which attributes belong to a data warehouse. Indicate whether the
statements are true or false.
Statement
True
False
a
Data is organized by time.
Data exists in the data warehouse specifically for analysis
by time.
True
b
Data is always stored in a relational database.
It is not imperative that the data be stored in a relational
database, although it is more common.
c
Data relates to business-specific areas.
The data warehouse may be enterprisewide but the way the
data is organized within the database is by departmental
need, subject need, and functional need.
d
Data is sometimes integrated.
Data must always be cleaned and integrated into the
warehouse.
False
e
Data is replaced according to a refresh cycle.
Data is added to and not replaced.
False
f
Data warehouses may contain any type of data.
If the database server supports any type of data, then the
warehouse is capable of holding any type of data.
False
True
True
2 _______ is a set of rules or structures providing a framework for the overall design
of a system or product.
a Technical infrastructure
b Data-access environment
c Architecture
The correct answer is C.
3 The ________ is closely related to the architecture and consists of the
technologies, platforms, databases, gateways, and other components necessary to
make the architecture functional within the corporation.
.....................................................................................................................................................
A-4
Data Warehousing Fundamentals
Practice 3-1
.....................................................................................................................................................
Data access environment
b Technical infrastructure
c Data warehouse
The correct answer is B.
4 A telco company needs to understand their network traffic to better pinpoint
frequent trouble spots and predict network expansion and usage. Storing call detail
records and summarizing them by switch and trunk groups among other things in
another environment will satisfy this need.
Which of the following are you going to design?
a Operational data store (ODS)
b Data warehouse
The correct answer is B because monitoring over a period of time is required.
5 An online bookstore has customers in their Sales Order System and in their
Marketing System. These customers do not match between systems, because
Marketing staff do not always update the Marketing System with current and
complete customer data. The need here is for an integrated system that contains
current customer data.
Which of the following are you going to design?
a Operational data store (ODS)
b Data warehouse
The correct answer is A because the organization needs current and integrated
customer data.
6 Below are some of the benefits of data warehousing.
a Business decisions:
– Improves decision making process
– Provides basis for strategic planning efforts
– Improves business decisions (quality and quantity)
– Improves sales metrics
– Improves trend visibility
– Improves cost analysis
– Improves inventory and distribution channel management
– Improves monitoring of business initiatives
b Data access:
– Improves data availability and timeliness
a
.....................................................................................................................................................
Data Warehousing Fundamentals
A-5
Appendix A: Practice Solutions
.....................................................................................................................................................
– Improves data quality
– Improves data integration
– Improves access to historical information
– Provides easier data access
– Allows high performance data mining
– Allows access to data not previously available
– Improves data availability for customers
c Costs:
– Reduces staff
– Identifies lost revenue
– Optimizes space utilization
– Reduces inventory
– Reduces inventory replenish time
d Productivity:
– Provides access to data without programmer intervention
– Facilitates elimination of legacy system
– Reduces analysis efforts
– Reduces impact on operational systems
– Reduces manual analysis and data consolidation efforts
.....................................................................................................................................................
A-6
Data Warehousing Fundamentals
Practice 4-1
.....................................................................................................................................................
Practice 4-1
Interview Questions
Ask the key persons the following questions.
Possible responses from each of the candidates are shown below.
Role 2: CFO
1 What is the business vision?
– We are the market leader with a long tradition of dealing with drinks and
beverages.
– We have survived by having a strong and focused management.
2 Why does the company need an enterprise data warehouse?
The board thinks that it is required to help maintain our competitive edge and
market leading position.
3 What do you expect the data warehouse to provide, or what will you get out of the
warehouse?
Directly nothing because our financial systems are fine, but it should keep the IT
Director happy.
4 How soon do you need to have data loaded into the data warehouse and how up-todate does the data need to be?
If we were to do this properly, we will need all the information in the warehouse
up-to-date all the time.
Role 3: COO
1 What is the business vision?
– Need to reengineer out core processes to maintain our market position.
– Overall goal is to give my group better control over the business.
2 Why does the company need an enterprise data warehouse?
To integrate the information from our disparate legacy systems and new systems as
they come online—this should allow us to quickly analyze any of the information
we hold.
3 What do you expect the data warehouse to provide or what will you get out of the
warehouse?
– Detailed customer information such as who buys our products and where our
products go, provide tracking information in case there is a need to recall
things.
– Let us see demographics of beverage types from around the world.
– Allow us to perform “what-if” analysis.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-7
Appendix A: Practice Solutions
.....................................................................................................................................................
4 How soon do you need to have data loaded into the data warehouse and how up-to-
date does the data need to be?
– Daily for our top 50 customers and a weekly update for the rest.
– We would probably also want to resegment our customers based on new
transactions, for example, once per month.
Role 4: IT Director
1 What is the business vision?
To support the mission statement, we need bigger and better systems to enable us
to become more competitive.
2 Why does the company need an enterprise data warehouse?
In the new and modern business world you need a warehouse. Our competitors
have one and we must have one in order to compete with them.
3 What do you expect the data warehouse to provide, or what will you get out of the
warehouse?
– Better information
– Better control of new products
– Take our disparate systems and help integrate them, which will bring real
business benefit and control
4 How soon do you need to have data loaded into the data warehouse, and how upto-date does the data need to be?
We will have daily loads for our top 50 customers, with a weekly catch-up for the
rest.
Class Discussion
1 Identify the major challenges for a data warehousing implementation project, as
shown in this exercise.
2 Give your suggestions on how to overcome these challenges.
3 If you apply the Oracle Data Warehouse Method in this implementation to this
project, how would apply it and where do you see the benefits from using this
method?
General Answers
This exercise has been designed to get you thinking about some of the many issues that
face any DSS implementation, regardless of size or complexity. The following sections
outline some of the issues.
Political Issues
• Conflict between different parts of the business. In many businesses, very high
barriers have been constructed between departments; thus the DSS can be
considered to be a threat, because it will remove these barriers.
.....................................................................................................................................................
A-8
Data Warehousing Fundamentals
Practice 4-1
.....................................................................................................................................................
•
•
•
•
Resistance to free and open information.
General resistance to change. DSS implementations by their nature invoke an
emotional reaction to change and so change management should be considered
carefully. Avoid making statements such as “the system will help you make better
decisions” because statements like this are emotionally charged.
IT will tend to control the project. IT will see the problem as technical architecture
and will therefore seek to own it.
The business may see it as an IT project. This follows on from the last point. The
business has to step up to the project. There are difficult decisions to make such as
regarding what data we place in the system, how that data is defined, how long to
keep it, and how to represent it. These are the decisions that the business must take,
and not IT.
Approach Issues The approach to the project will have a significant impact on the
overall success of the project. Some of the issues typically associated with a “bottomup” approach include:
• The data warehouse may end up as a complex repository for operational data
rather than one that can support the business decision making required. If this is
the case, the business will inevitably lose faith in the system.
• The system will eventually lose faith in the data warehouse, and so it will become
another piece of legacy.
• Failure to address data quality. A “bottom up” approach led by IT will typically
avoid tough issues such as data quality, because IT typically lacks the influence to
solve the problem. The solution lies with the business and not IT.
• Over or underengineering of the solution will result, because it is difficult to hit a
target when you don’t know what it looks like, especially if it is a moving target.
• If the solution is seen by the business as technology rather than as a business
solution, they are unlikely to invest time and effort in it. We know that if the
business linkage is not present, the solution is unlikely to succeed.
Sponsorship Issues
• Sponsorship is critical to a project success.
• Sponsorship must be effective—it is all well and good to have senior business
sponsorship in the project, but this must be effective and active sponsorship, that
is, involvement must be more than just attending regular meetings.
• The key sponsorship chain is linked to business rather than IT. This is largely
because many of the more difficult, softer issues revolve around the business and
therefore need a business pull rather than IT push to resolve.
• Communication to all stakeholders within the business is critical. The aims and
aspirations for the project should be communicated as well as the progress of the
project to assist in overcoming eventual resistance to change.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-9
Appendix A: Practice Solutions
.....................................................................................................................................................
Business Vision Issues The business must be clear about a number of factors:
• How the warehouse will add value to the business
• Why the warehouse will result in business change
• How business change will impact on the warehouse
You may have noticed that the above issues constitute a circular argument, which is
important for everyone concerned to understand fully. If the warehouse is not going to
change your business, why build one?
General Information Issues Were the right questions asked and were honest
answers always given?
• You will need to ask different questions to different parts of the business.
• You may not get the answers you need, because of a number of organizational and
technical issues.
Because much of the information we need is both tacit and politically sensitive, you
should not be afraid to ask follow-up questions.
.....................................................................................................................................................
A-10
Data Warehousing Fundamentals
Practice 5-1
.....................................................................................................................................................
Practice 5-1
1 There are no standard solutions to item 1, as the answers are subjective and
unique to each student.
2 Similarly, there are no standard solutions to item 2, as the answers are subjective
and unique to each user.The expectation is that students will utilize every strategy
deliverable listed in the table, as each deliverable is considered essential for a
successful warehouse implementation.
3 See answer to item 2.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-11
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 6-1
1 Complete the user profile column in this exercise with one of the following user
types:
– Executive
– Casual user or manager
– Business analyst or power user
Name
Brian O’ Reilly
Access Needs
• Need to develop simple
forecast, such as
budgets
• Ease of use is important
Mary Ramos
•
•
•
Kim Seng
•
•
Amber Salinas
•
•
•
One click access
Only need highly
summarized
information
Ease of use is very
important
Constantly wants to
“get more data”
Understands the
organization’s business
processes
Lots of drilling
Customize graphical
user interface (GUI)
Needs to know data
structures
Technology
• Microsoft Office
• Internet browser
• Spreadsheets
• Email
• Email
• Microsoft Office
• Internet browser
User Profile
Casual user or
manager
•
•
•
•
Business
analyst or
power user
•
•
•
Spreadsheet
Oracle Reports
Oracle Discoverer
Oracle Express
Analyzer
Extensive SQL
programming
Oracle7X,
Oracle8X Server
Oracle Express
Executive
Business
analyst or
power user
2 Answer true or false to the following questions.
Question
True
False
a
Do not involve users in the early process of the data warehouse
implementation because they are going to delay your delivery
date.
False
b
Choose the warehouse data access tools by involving only IT
staff because they are the ones who know what the users need.
False
c
Prototype access methods with prospective users.
True
3 Security Consideration exercise: There are no standard solutions to this
question.
.....................................................................................................................................................
A-12
Data Warehousing Fundamentals
Practice 7-1
.....................................................................................................................................................
Practice 7-1
1 Identify whether the following statements are true or false.
Question
The business model is a logical representation of
selected business processes.
The star model is normalized.
The snowflake model is denormalized.
All warehouses must have a time dimension.
In a warehouse environment, data loading performance
is less important than query performance.
True
True
False
False
False
True
True
2 Complete these sentences.
Access to data in a _________ table is faster than calculating aggregates at the
time of query execution.
The correct answer is summary.
b The data warehouse model contains ____ tables that comprise the measures of
the business.
The correct answer is fact.
c Dimensions are denormalized in a _______ model.
The correct answer is star.
d A common guideline is to define granularity at one level ________ than
currently used by end users.
The correct answer is lower.
3 There are no standard solutions to item 3, as the answers are subjective and
unique to each student.
a
.....................................................................................................................................................
Data Warehousing Fundamentals
A-13
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 8-1
1 Form into small groups, and consider each of the following hardware
architectures. With your books closed, create a short definition for each
architecture. Each answer should include the benefits and limitations of each
architecture.
a Symmetric multiprocessing (SMP): Definition, benefits, and limitations.
Please refer to pages 8-12 to 8-13.
b Non-Uniform Memory (NUMA): Definition, benefits, and limitations. Please
refer to pages 8-14 to 8-15.
c Clusters: Definition, benefits, and limitations. Please refer to pages 8-16 to
8-17.
d Massively parallel processing (MPP): Definition, benefits, and limitations.
Please refer to pages 8-18 to 8-21.
2 Staying in your small group, discuss each of the following questions.
a What is parallelism?
It is the ability to perform functions in parallel.
b Why is it important to the data warehouse?
The appeal of parallel processing is especially strong for the data warehousing
environment because of its emphasis on interactive processing of complex
queries. Given this characteristic as well as the often extreme size of a
warehouse, methods are clearly needed for more rapid query execution. By
partitioning data among a set of processors, complex queries can be executed
in parallel. This will potentially achieve linear speedup and thus significantly
improve query response times.
.....................................................................................................................................................
A-14
Data Warehousing Fundamentals
Practice 9-1
.....................................................................................................................................................
Practice 9-1
1 For the following description, state the type of partitioning method it best
describes. The partitioning methods are range partitioning, hash partitioning, and
composite partitioning.
Description
Places specific ranges of table entries on different disks. For
example, records having “name” as a key may have names
beginning with A-B in one partition, C-D in the next, and so
on. Likewise, a DSS managing monthly operations might
partition each month onto a different set of disks.
Distributes DBMS data evenly across the set of disk
spindles. This partitioning method is applied to one or more
database keys, and the records are distributed across disk
subsystems accordingly.
The drawback of this partitioning method is that the quantity
of data may vary significantly from one partition to another
and the frequency of data access may vary as well. For
example, as the data accumulates, it may turn out that a
larger number of customer names fall into the M-N range
than the A-B range.
This partition method is a combination of two partitioning
methods. A table that is partitioned using this method is
initially partitioned by range, and then subpartitioned using
the hash method.
Partitioning Method
Range
Hash
Range
Composite
.....................................................................................................................................................
Data Warehousing Fundamentals
A-15
Appendix A: Practice Solutions
.....................................................................................................................................................
2 For each of the following descriptions, state the type of indexing method it best
describes. The indexing methods are B-tree, bimap, and index-organized tables.
Description
Contains a hierarchy of highest-level and succeeding lowerlevel index blocks. The upper level blocks are called branch
blocks and they point to the lower-level blocks. The leaf
blocks are the lower-level blocks and they contain the unique
ROWID that points at the location of the actual row.
This indexing method will benefit queries in which the
WHERE clause contains multiple predicates on lowcardinality columns.
Indexing Method
B-tree
Bitmap
Bitmap
Table Row ID
0001
0002
0003
0004
Each row has
a bit for each key
Male
1
0
0
1
Female
0
1
1
0
Each key value has
a bit for each row.
This method merges table data and index data into one
structure. Thus, the data is the index and the index is the
data.
Index-organized table
3 Form into small groups, and consider each of the following questions. For each
question, discuss in your groups and present your group’s answers to the class at
the end of the discussion.
a How does RAID-5 differ from RAID-1?
RAID-1 (mirroring) is a strategy that aims to prevent downtime due to loss of a
disk, but whereas RAID-5 in effect divides a file into chunks and places each on
a separate disk, RAID-1 maintains a copy of the contents of a disk on another
disk, referred to a mirrored disk. Writes to a mirrored disk may be a little slower
because more than one physical disk is involved, but reads should be faster
because of a choice of disks (and hence head positions) to seek to the require
location.
b How do I decide between RAID-5 and RAID-1?
.....................................................................................................................................................
A-16
Data Warehousing Fundamentals
Practice 9-1
.....................................................................................................................................................
RAID-1 is indicated for systems where complete redundancy of data is
considered essential and disk space is not an issue. RAID-1 may not be
practicable if disk space is not plentiful. On a system where uptime must be
maximized, Oracle recommends mirroring at least the control files, and
preferably also the redo log files.
RAID-5 is indicated in situations where avoiding downtime because of disk
problems is important, or when better read performance is needed and
mirroring is not in use.
c What variables can affect the performance of a RAID-5 device?
The major ones are access speed of constituent disks; capacity of internal and
external buses; number of buses; size of caches; number of caches; and nature
of the algorithms used for determining how reads and writes are done.
d What types of files are suitable for placement on RAID-5 devices?
Placement of data files on RAID-5 devices is likely to give the best performance
benefits, because these are usually accessed randomly. More benefit will be seen
in situations where reads predominate over writes. Rollback segments and redo
logs are accessed sequentially (usually for writes) and therefore are not suitable
candidates for being placed on a RAID-5 device. Also, data files belonging to
temporary tablespaces are not suitable for placement on a RAID-5 device.
4 For each of the descriptions below, assign the RAID level that is RAID 0, RAID 1,
or RAID 5.
Description
This RAID level has the
lowest cost and highest
performance.
This RAID level is low cost
and has high availability.
This RAID level has high
performance and high
availability.
RAID Level
0
5
1
.....................................................................................................................................................
Data Warehousing Fundamentals
A-17
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 10-1
Please answer the following questions.
1 The acronym ETT stands for _________________________________________.
The correct answer is extraction, transformation, and transportation.
2 Name at least four potential sources of production data for the warehouse.
_____________________
_____________________
_____________________
_____________________
Correct answers include production operational systems; archives; internal files
not directly associated with company operational systems, such as individual
spreadsheets and workbooks; external data from outside the company.
3 Name at least five potential sources of external data for the warehouse.
___________________________________________
___________________________________________
___________________________________________
___________________________________________
___________________________________________
Correct answers include periodicals and reports; external syndicated data feeds;
competitive analysis information; newspapers; purchased marketing, competitive,
and customer related data; free data from the Web.
.....................................................................................................................................................
A-18
Data Warehousing Fundamentals
Practice 10-1
.....................................................................................................................................................
4 Identify whether the following statements are true or false.
Question
Archive data is never used in a data warehouse; it is too old.
Archive data is particularly useful for the first time load, to
include historical data.
External data is one of the easiest types of data to
incorporate into the warehouse.
External data is difficult to incorporate, as it varies in
frequency, grain, and predictability.
It is impractical to eliminate data anomalies after the pilot
run.
Never leave data cleanup this late.
Mapping data is a process whereby you eliminate data
inconsistencies.
Mapping identifies source data attributes, identifies where
they are to reside in the warehouse, and identifies what
transformations are needed.
Gateways are great mechanisms for transferring large
volumes of data into the warehouse.
Gateways are only useful for smaller amounts of data.
Extraction tools are expensive.
Transforming data occurs only in the staging area.
It may take place at other points, though the staging area is
most common.
True
False
X
X
X
X
X
X
X
.....................................................................................................................................................
Data Warehousing Fundamentals
A-19
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 11-1
1 Dirty data must be eliminated for the data warehouse. Name three alternative and
common terms used to describe the process of eliminating anomalies in data.
_____________________
_____________________
_____________________
The correct answer is cleaning, cleansing, and scrubbing.
2 Name at least five problems associated with source data that must be eliminated
for the data warehouse.
___________________________________________
___________________________________________
___________________________________________
___________________________________________
___________________________________________
The correct answer is multipart keys, multiple encoding, multiple local standards,
multiple files, missing values, element names, element meaning, input formats,
duplicate values, referential integrity, names and addresses.
3 Identify whether the following statements are true or false.
Question
True
False
It is considered impractical to eliminate data anomalies after
the pilot run.
Never leave data cleanup this late.
You need to consider adding time keys to warehouse data.
All records must contain a time element contained in a key
column.
Data transformation occurs only in the staging area.
It may take place at other points, though the staging area is
most common.
X
X
X
.....................................................................................................................................................
A-20
Data Warehousing Fundamentals
Practice 12-1
.....................................................................................................................................................
Practice 12-1
1 Assemble into small groups of 3 or 4. Discuss and compare the factors that will
determine the load window where you work. Consider user requirements,
operational constraints, and staffing issues.
There is no single correct answer.
2 Identify whether the following statements are true or false.
Question
True
False
Transportation of data involves moving the data
X
into the data warehouse database.
Strictly transportation involves move and loading the
data.
The data refresh cycle is determined by information
technology groups.
X
The cycle is determined by users.
The load window is the time that the IT group has
dictated the data warehouse is available to the users
for access.
X
The load window is time available to perform all ETT
tasks.
An example of high-level grain data is summarized
data.
Fact data frequently changes.
X
X
Fact data is frequently added to at every refresh.
Dimension data infrequently changes.
X
Dimension data changes but not as frequently as fact
data is refreshed.
SQL*Loader is the fastest way to move data into
the data warehouse database.
Gateways are useful for moving large amounts of
data into the warehouse.
X
X
Gateways are recommended only for small amounts of
data.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-21
Appendix A: Practice Solutions
.....................................................................................................................................................
Question
Data for the data warehouse is always indexed after
it is loaded.
True
False
X
It is recommended, but is not always indexed after.
The quickest way to create unique indexes on
warehouse data is to leave database constraints
enabled on load.
X
The fastest way is disable constraints and then enable
them after the data is loaded.
Summary tables are created on the warehouse
server.
Filtering removes unwanted records from staging
files.
X
X
Filtering extracts data from the warehouse into data
marts.
3 Name the two different types of data loading.
_____________________
_____________________
The correct answer is first time load and refresh.
4 Name four methods of moving data to the warehouse server.
_____________________
_____________________
_____________________
_____________________
The correct answer is that there are five listed ways, and you may choose a hybrid
of any of these.
– Wholesale data replacement
– Comparison of database instances
– Time and date stamping
– Database triggers
– Database log
5 What SQL command is used to create summary tables on the data warehouse
server?
The correct answer is CREATE TABLE AS SELECT (CTAS), or
CREATE TABLE AS SELECT... PARALLEL (pCTAS).
.....................................................................................................................................................
A-22
Data Warehousing Fundamentals
Practice 13-1
.....................................................................................................................................................
Practice 13-1
1 Identify whether the following statements are true or false.
Question
The data refresh cycle is determined by information
technology groups.
The cycle is determined by users.
Fact data frequently changes.
Fact data is frequently added to at every refresh.
Dimension data infrequently changes.
Dimension data changes but not as frequently as fact
data is refreshed.
a
b
c
True
False
X
X
X
2 Name four different techniques for capturing the changes to operational data that is
to be loaded into the warehouse.
_____________________
_____________________
_____________________
_____________________
The correct answer is that there are five listed ways, and you may choose a hybrid
of any of these.
– Wholesale data replacement
– Comparison of database instances
– Time and date stamping
– Database triggers
– Database log
3 Answer the following questions about updating dimension data.
What method of updating dimension data would you employ if you wanted to
keep old and new records?
The correct answer is keep history.
b What relationship would that map to in an entity relationship model?
The correct answer is a one to many.
4 What server technique can be used to prevent and allow access to data in the
warehouse after refresh?
The correct answer is the ROLES command.
a
.....................................................................................................................................................
Data Warehousing Fundamentals
A-23
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 14-1
1 Give one example of where metadata exists in an operational environment.
2
3
4
5
________________________________________________________
The correct answer is the database server data dictionary.
Why is metadata important to the following people?
a Users who are accessing the data warehouse
________________________________________________________
________________________________________________________
The correct answer is that it provides them with information about the data
they are accessing, and shows them the meaning of data, context, summary
levels, ownership and many more attributes.
b IT staff developing ETT routines
________________________________________________________
________________________________________________________
The correct answer is that it contains all source data information,
transformation routines, mapping, structure and meaning of data.
Name two techniques you might employ to create metadata.
________________________________________________________
________________________________________________________
The correct answer is that you may have chosen two from this list:
Data modeling tools, data dictionary, ETT tools, end user tools, COBOL
copybooks, middleware tools.
Name two roles within the data warehouse development team who have
responsibility for metadata.
________________________________________________________
________________________________________________________
The correct answer is metadata architect, metadata manager.
What is the issue with integration and metadata?
________________________________________________________
________________________________________________________
________________________________________________________
The correct answer is that many tools have their own metadata layers, which must
be integrated for the environment.
.....................................................................................................................................................
A-24
Data Warehousing Fundamentals
Practice 14-1
.....................................................................................................................................................
6 What is important about the context of data?
________________________________________________________
________________________________________________________
The correct answer is that it allows the historical perspective of data to be
constantly available.
7 Name the Oracle tool you can use to develop metadata.
________________________________________________________
The correct answer is Oracle Designer, Data Mart Suite, or OADW. Oracle
Warehouse Builder will also support metadata management.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-25
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 15-1
1 In the following scenarios, choose the type of analysis that most accurately defines
the scenario. The types of analysis from you may choose are:
– Query and reporting
– Multidimensional/OLAP
– Data mining
– Drill-down and pivot
– Calculations and derived data
– Spreadsheet
– Modeling, time-series and financial
– What if
Scenario
a. Show start date and salary grade for all employees reporting to
Clare Maury
b. Highlight all orders above $30,000.00
• Drill from product totals to individual orders
• Look at a copy of the invoice
c. Show product sales in each region as a percentage of the total
sales in that region.
d. Did the $2 million promotion increase sales?
e. How many people to hire, when to hire them, and where to
locate them.
f. If we lowered prices, would our overall revenue increase?
g. Find me the relationship between X and Y.
h. Show me all the products that are currently back-ordered.
i. What is the 13 week moving average of sales?
j. Projecting costs and allocating overhead based on head count,
sales forecasts, and consumer price index (CPI).
Type of Analysis
Query and reporting
Drill-down and pivot
Calculation and
derived data
Modeling, time-series
and financial
Modeling, time-series
and financial
What-if
Data mining
Query and reporting
Calculations and
derived data
Modeling, time-series
and financial
2 For the following phrases and sentences, determine which category each of them
belongs to. You may choose from the following list.
• Data
• Information
• Knowledge
.....................................................................................................................................................
A-26
Data Warehousing Fundamentals
Practice 15-1
.....................................................................................................................................................
•
Decision
Description
Mary lives in Belmont Shores, California.
Point of sale (POS)
AppleTree juice is bought 45% of the time that
Crystal Geyser juice is bought.
Let us promote Crystal Geyser juice on the East
Coast of the United States in stores.
Demographic
Customers of the upper middle class will use 10% of
their annual income during the Christmas holiday
season.
Category
Information
Data
Knowledge
Decision
Data
Knowledge
3 The diagram below illustrates an example of data mining. The technique that it
uses is called _________________.
The correct answer is artificial neural network.
Age
Region
Loyal
Call Rate
Lost
Service
4 The description below describes a data mining technique. What is the technique
used?
The correct answer is decision tree.
1.
2.
3.
4.
5.
6.
If the vehicle has a 2-door frame AND
If the vehicle has at least six cylinders AND
If the buyer is less than 40 years old AND
If the cost of the vehicle is > $35,000 AND
If the vehicle color is red, THEN
The buyer is likely to be male.
.....................................................................................................................................................
Data Warehousing Fundamentals
A-27
Appendix A: Practice Solutions
.....................................................................................................................................................
Practice 16-1
Web-Based Tools Requirement Checklist
There are no standard solutions to this question.
.....................................................................................................................................................
A-28
Data Warehousing Fundamentals
Glossary
..................................
Glossary
.....................................................................................................................................................
A
Access The process of accessing the data
warehouse database objects containing data
using tools that perform analysis, standard
queries, provide statistical information, and
mine data. See OLAP, Data Mining, Data
Access.
Additive
Measurements in a fact table that
can be added across all the dimensions. See
Dimension.
Ad hoc One time only, casual, nonplanned
access to the database. See Access, Data
access.
Aggregated data Precalculated and prestored summary data that is held in tables in
the data warehouse. Aggregated data provides direct access to calculated data that
improves query performance. Functions used
to calculate aggregated data include SUM,
MAX, MIN, COUNT, and AVG. See Summary Tables.
Aggregated facts See Aggregated facts,
Summary tables.
Application Program Interface A set of
calling conventions that allow application
programs to access computing services. APIs
present application developers with a published interface to computing services that
can be used with other facilities to provide a
single-system image across a heterogeneous
network of processors.
Atomic data The data at its lowest level of
detail that provides the base data for all data
transformations.
Attribute Any detail that serves to qualify,
identify, classify, quantify, or express the
state of an entity.
B
Backup and recovery strategy A storage
and recovery strategy that protects against
business information loss resulting from
hardware, software, or network faults.
BAP See Business Alliance Program.
Batch A computer environment that processes an action or user request without user
interaction. Some batch programs work in
the background, allowing simultaneous user
access.
Bitmap index A specialized form of index
indicating the existence or nonexistence of a
record by a series of ones and zeros. Prevalent with the Oracle7 and Oracle8 database
servers.
Bitmapped interface See graphical user
interface.
Business An enterprise, commercial entity,
or firm in either the private or public sector,
concerned with providing products or services to satisfy customer requirements.
Business area The set of business processes
within the scope of a data warehouse project.
Business Alliance Program (BAP) An Oracle initiative that invites vendors to offer
products and services that are complementary to those offered by Oracle.
Atomic value A data value that cannot be
further decomposed.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-3
Glossary
.....................................................................................................................................................
Business metadata The information provided to users that allows them to understand
and access warehouse data. It focuses on
what data is in the warehouse, how it was
transformed, the source, and the timeliness
of the data. See User Metadata.
Business rule A rule under which an organization operates.Business rules are applied to
data using constraints.
C
C
A third generation programming language.
C++ A thrid generation programming language.
Cache
A temporary storage area in computer memory.
Cardinality The number of rows in a table.
See Table, Column, and Row.
CASE See Computer-aided systems engineering.
Checkpoint A database server event which
at a point in time writes all modified database buffers in the system global area to the
data files. The process controlling this action
is called the Database Writer (DBWR).
Cleaning See Cleansing.
Cleansing The process of transforming the
operational and external source data into a
defined, and standardized format using packaged software applications or programs,
prior to moving that data into the warehouse.
Also referred to as data cleaning, data
cleansing, or scrubbing. See Source data.
Client-server
A technical architecture
that links many personal computers or workstations (clients) to one or more large processors (CPUs or servers). The architecture
enables the separation of local client processing from the server that manages the databases, access, and data integrity. The
architecture allows for optimal performance
at both the client and the server sides.
Cluster A means of sorting and storing
related data from different tables in the database, on cluster keys. Advantageous in an
environment where related data is commonly
queried together.
COBOL A third generation programming
language.
Column A means of implementing an item
of data within a table. See Table, Row,
Attribute.
Composite key
A key in a database table
that is made up of a number of (column or
field) values.
Compound key See Composite key.
Computer-aided systems engineering
(CASE) The combination of graphical, dictionary, generator, project management, and
other software tools to assist computer development staff engineer and maintain highquality systems.
Concatenated key
See Composite key.
Concatenated index An index that is created on a composite key. See Composite key.
Constellation model A warehouse model
that comprises a collection of star models.
See Star model, Snowflake model.
.....................................................................................................................................................
Glossary-4
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Constraint 1.The part of the WHERE clause
in an SQL SELECT statement that identifies
the column or field value that qualifies the
query. 2. Any external, management, or other
factor that restricts a business or a systems
development in terms of resources, availability, dependencies, timescales or some other
factor. See Business rule.
CORBA Common Object Request Broker
Architecture
Corporate data model
A model of the
business needs and data requirements for an
online transaction processing system.
Cost based optimizer A statistical mechanism that analyzes where and how to retrieve
data from the Oracle7, Oracle8, and Oracle8i
servers to ensure fast access to data.
Cube
A commonly used name for a
dimensional database where values can be
analyzed across a minimum of three dimensions.
D
DASD See Direct-access storage device.
Data access See Access.
Data acquisition The process of extracting,
transforming, and transporting data from the
source systems and external data sources to
the data warehouse database objects. The
term is synonymous with ETT, and is widely
used within Data Warehouse Method. See
ETT.
Data aggregation The process of redefining data into a summarization based on some
rules or criteria. See Aggregated data,
Aggregated facts, Summary tables.
Data Definition Language (DDL) SQL
statements that create, modify, and remove
database objects such as tables, indexes, and
users. Common DDL statements are CREATE, ALTER, and DROP. See DDL.
Data extract
A subset of data extracted
from one environment and transported to
another environment. See Extract processing.
Data integrity The quality of the data residing in the database objects. Constraints on
the database tables enforce integrity rules.
Data Manipulation Language (DML)
SQL statements that query and amend the
database data. Common DML statements are
SELECT, INSERT, UPDATE, and DELETE.
See DML.
Data mart
A data warehouse data class
organized for a business functional area or
department. The database contains data summarized at multiple levels of granularity and
maybe designed using relational or multidimensional database structures. Data migration tools Unspecified tools that allow data
to be moved from the various sources into
the data warehouse.
Data mining A technique that discovers
previously unknown patterns and relationships in data. Data mining queries may take a
long time to execute.
Data warehouse An enterprise-structured repository of subject-oriented, time
variant, integrated, historical data used for
information retrieval. The very large data
warehouse database stores atomic and summary level data. The data warehouse provides the source data for data marts within
the enterprise.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-5
Glossary
.....................................................................................................................................................
Data Warehouse Method (DWM)
A
structured method for full life-cycle custom
development data warehouse projects. It is
based on the Custom Development Method.
See Custom Development Method.
Database A collection of data, usually in the
form of tables or files, under the control of a
database management system. See Database
management system.
Database administrator A person within
the information technology (or information
systems) organization who is responsible for
administering, monitoring, and maintaining
the database.
Database management system The component of a database that controls all user and
system activities related to the core functions
of the database, such as security checking,
tablespace allocation, space management.
Data model A representation of the specific
information requirements of a business area.
See Entity relationship diagram.
Data source See Source.
DBA
See Database administrator
DBMS See Database management system,
Relational database management system.
DDL
See Data Definition Language.
Decision support
The act of using data
and tools within an organization to support
managerial decisions. Usually decision support involves the analysis of many units of
data in a heuristic fashion. As a rule, decision
support processing does not involve updating
data. See Heuristic.
Decision support systems (DSS)
An
application used to provide summary or consolidated data to users for analysis, planning,
and performing what-if analysis by using
specialized tools that are usually driven by a
GUI. See Graphical user interface.
Delta A file created by an application that
contains only changes made to the application.
Denormalization A database design function that restructures a database by introducing derived data, replicated data, and
repeating data. The technique is often
employed to enhance performance within
decision support and data warehouse environments. See Data warehouse, Decision
support systems.
Denormalized data
The data within a
denormalized database model. See Denormalization.
Dependent data mart A data mart that is
sourced directly from an existing data warehouse. See Data mart, Independent data
mart.
Derived column A value derived by some
algorithm from the values of other columns.
See Derived data.
Derived data Data that exists only as a subset of other data. Also called Derived
attribute.
Designer/2000
The Oracle computeraided systems engineering (CASE) tool.
Detail data See Fact data.
.....................................................................................................................................................
Glossary-6
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Developer/2000 The Oracle application
building tool for query, reporting, database
manipulation, and graphical display of database values.
Dimension
A construct within a multidimensional structure that represents a side of
a multidimensional cube. Each dimension
represents a different category that the business chooses to measure by, such as customer, region, product, and time.
Dimension data The data by which the user
queries the business measurables. Contained
in dimension tables. See Fact data, Fact
tables, Dimension table, Dimension model.
DML
See Data Manipulation Language.
Drill-across A technique that queries data
from two or more fact tables in a single
report.
Drill-down
An analytical technique that
queries data from a summary row and navigates through a hierarchy of data to reach the
detail-level rows.
Drill-up An analytical technique that navigates from detail to header rows of data. Use
to view summarized (or aggregated data).
DWM See Data Warehouse Method.
Dimension table A table in a star model that
is joined to the fact table by a key value.
E
Dimensional model A model that supports a
top-down design methodology. For each
business process, it determines relevant facts
and dimensions.
End User Layer (EUL) The user interface and layout of multidimensional structures designed for the data access tools. This
includes customization of the tools for end
users.
Direct-access storage device (DASD) A
data storage unit where data can be accessed
directly without having to progress through a
serial file such as magnetic tape.
Enterprise A group of departments, divisions, groups, or companies that make up a
business. See Business.
Dirty data Data that is in an unfit state to be
loaded into the data warehouse. It must be
transformed first. See Transformation,
Cleaning.
Discoverer
The Oracle end-user analysis,
query, and reporting tool that is particularly
good for use in the data warehousing environment.
Discrete Usually used with reference to
dimension attributes. Data, usually text, that
takes on a fixed set of values that rarely
change.
Enterprise Manager
An Oracle product
that gives a GUI front end to systems and
databases for enterprise wide systems management.
Enterprise model
business.
A neutral model of the
Entity relationship diagram (ERD) A diagram that pictorially represents entities, the
relationships between them and the attributes
used to describe them.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-7
Glossary
.....................................................................................................................................................
Entity relationship model (ERM) A type of
data model. Part of the business model that
consists of many entity relationship diagrams. See Entity relationship diagram.
ETT An acronym that stands for extraction,
transformation, and transportation. It refers
to the methods involved in cleaning operational data and moving it from source systems into the warehouse.
Fact table
The core (central) table in a
star or snowflake model, characterized by a
composite key. Values in the composite key
join to keys in the dimension tables. See
Composite key, Dimension table, Detail
data.
Feedback Response to requests, including
corrections, additions, and approval elicited
from users, sponsors, and any others with an
interest in the data warehouse.
EUL See End-user layer.
Express The generic name of a suite of
Oracle products that enable users to analyze
multidimensional data and perform complex
analysis for decision support.
External data Data originating from a nonoperational source or outside the central processing complex, such as magazines,
newspapers, and financial companies.
File Transfer Protocol (FTP) A method for
transferring files from one location to
another.
Foreign key
A key data value, (which
may comprise one or more columns), in a
relational database table that joins to a primary key on another table. See Primary key.
Forms See Oracle Forms.
Extract processing The process of selecting data from one environment and transporting it to another environment for use by
individual users or departments.
FTP See File Transfer Protocol.
Extraction The process of selecting and
pulling data from the operational and external data sources, in order to prepare it for the
warehouse. Also called data extraction.
Gap analysis The process of determining
and evaluating the variance between two
items’ properties.
Extraction, transformation, and transportation See ETT.
F
Fact data The measurements, within the
core of the data warehouse, on which all
OLAP queries depend. See Online analytical
processing, Fact table.
G
Gateway A technology that enables interserver communication using various communication protocols.
Generalized key A dimension table primary
key that is created by modifying an existing
key. Generalized keys are also used with
slowly changing dimensions and summary
data.
Gigabyte One thousand million bytes.
.....................................................................................................................................................
Glossary-8
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Grain The level of detail of the data stored
in the database or data warehouse or moved
into the data warehouse from source systems.
Granularity See Grain.
Graphical user interface (GUI) A user
interface that is driven by point-and-click
operations using a mouse rather than a keyboard. Also known as a bitmapped interface.
H
Heuristic The process of learning by discovery.
Hierarchical database An older style of
database where records are strictly related
and access is strictly defined.
Householding In the financial services sector, assigning a customer account or individual, to a collection of accounts, individuals,
or locations for marketing purposes.
Hypercube A multidimensional model supporting more than three dimensions. You can
visualize this model by considering a number
of three dimensional cubes that are related to
one another.
Hypertext Markup Language (HTML) The
language used to create HTML pages for the
Web using a word processor or text editor.
Hypertext Transfer Protocol (HTTP) The
first component, the protocol, of a URL
address, used widely in the Internet and
intranet environment. HTTP defines how to
interpret information. Other common protocols you may come across include FTP,
news, and gopher. See Uniform Resource
Locator.
I
Implementation The installation of an
increment of the data warehouse solution
(hardware, software, documentation, training) that is complete, installed, tested,
proved, operational and ready to use.
Increment The defined scope of the portion
of the data warehouse selected for implementation. Each increment satisfies elements
of the total data warehouse solution.
Incremental development A technique for
producing all or part of a production system
based on an outline definition. The technique
involves iterations of a cycle of build, refine,
and review so that the correct solution
emerges.
Independent data mart A data mart that is
sourced directly from operational systems.
See Data mart, Dependent data mart.
Index An area of the database storage dedicated to holding key data values to allow
direct access to a database row.
Information requirement The detail and
summary data and access functionality
required to satisfy the users’ decision support
and analysis functions for decision making
and planning.
Initial load The first population (insert) of
the production data warehouse database with
data from source systems. This load often
contains large amounts of historical data. See
Load, Refresh cycle.
Integrate To take data from a variety of different sources, in different formats, and
merge it into a single format.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-9
Glossary
.....................................................................................................................................................
Integrity rules The laws that govern the
operations allowed on the data and structures
of a database.
Internal data Data that resides within an
organization’s central processing complex.
Iterative development The application of a
cyclic, evolutionary approach to system
development.
K
Knowledge worker
A person whose job
relies on information as a primary resource.
L
Legacy system An existing operational system that is used for entering data about the
company’s operations.
Level fields These fields are often held in
dimension tables and relate to summary data
stored in the central fact table. Not a common approach to storing summary data.
Load The process of moving extracted,
transformed into the data warehouse. See
Initial load, Refresh cycle.
Load window
The time taken to load data
from multiple source systems into the data
warehouse. Can also be used to mean the
time available for the data load.
Logical model The phase of database design
that is concerned with identifying the relationships among the tables.
M
Mapping The process of matching data
from source systems to the structures in the
data warehouse.
Mapping tools Tools used to perform mapping.
Massively Parallel Processor (MPP)
A
shared nothing architecture that takes a number of nodes and enables them to communicate rapidly.
Metadata Data that contains information
about the data and structures in the data
warehouse. Metadata is both for business
users and technical users. See Business metadata and User metadata.
Metalayer An architectural component of
the warehouse that resides between the warehouse data and the user, and contains metadata. See Metadata.
Middleware A layer that provides an easyto-use, intuitive presentation of the underlying data or data structures.
MOLAP See Multidimensional online analytical processing.
Multidimensional analysis See Online analytical processing.
Multidimensional database A database
management system where data can be
viewed and manipulated in multiple dimensions. It provides a structure that supports
specialized query techniques such as drilldown, consolidation, and slicing and dicing.
See Cube.
.....................................................................................................................................................
Glossary-10
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Multidimensional online analytical processing (MOLAP) Data is stored and presented to the user over three or more dimensions.
OLAP Server A multidimensional database
that provides a data structure that enables
flexible access to data and explores the relationship between summary and detail data.
N
OLTP See Online transaction processing
system, Operational system.
Nonadditive A fact that cannot be logically
added between records. May be numeric and
must be combined in a computation with
other facts before being added across
records.
Nonuniform memory access (NUMA) A
method of accessing shared memory on systems which have memory loosely coupled.
Oracle Parallel server can work with this
access method.
Normalization A technique that eliminates
data redundancy. See Normalized data.
Normalized data Data that has been separated into groups linked by defining normal
relationships, where all redundancy in the
data and repeating groups of data are
removed. The usual normalization level is
called third normal form, represented as
3NF. See Normalization.
NULL The state of a data item that indicates
no value.
NUMA See Nonuniform memory access.
O
ODS See Operational data store.
Online analytical processing (OLAP) A
loosely defined set of principles that provide
a dimensional framework for decision support. Online analytical processing allows for
analysis of data to reveal business trends and
statistics that are not immediately visible in
operational data. Also known as multidimensional analysis.
Online transaction processing system
(OLTP) The process whereby day-to-day
transactional data is held in a repository that
contains the operational data for the business.
Operational data Data that is maintained
and used for the day-to-day processing and
functional requirements of the business.
Operational data store
A repository of
current and integrated operational data used
for analysis. It is often structured and supplied with data in the same way as the data
warehouse, but may act simply as a staging
area for data to be moved into the warehouse.
Operational system A system that supports
day-to-day transactional information that
supports the client’s business. See Online
transaction processing system.
OLAP See Online analytical processing.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-11
Glossary
.....................................................................................................................................................
Oracle Expert An expert systems advisor
that generates performance tuning recommendations based upon a global system
view. Suggestions regarding space allocation, schema design, and indexing strategies
help DBAs tune VLDB environments.
P
Oracle Forms An Oracle Developer/2000
tool for creating, maintaining, and running
full-screen, interactive applications called
forms. The forms enable users to see and
change data in an Oracle database. They can
be used in block mode, character mode or
bit-mapped environments.
Parallel Query Option
The Oracle server
option that splits a single database query
request into a series of parallel query operations. See Parallel Processor.
Oracle Method The methodology employed
by Oracle for corporate system implementation. Incorporates the Data Warehouse
Method and project management software.
Oracle Parallel Server
cessor, Oracle Server.
See Parallel Pro-
Oracle Reports The powerful, flexible
Oracle Developer/2000 report-writing tool.
Reports may be integrated with Oracle
Forms or run stand-alone.
Oracle Server
The Oracle relational database management system (RDBMS). Components of the Oracle server include the
kernel and various utilities for use by database administrators and users. See Relational
database management system, Server.
Oracle Trace A performance data management tool that collects, manages, and displays performance data from throughout the
enterprise, including resource use (CPU, I/O,
page faults) by user or component.
Parallel Processor
The Oracle server
component that splits a single database
action into many processes. See Parallel
Query Option.
Partitioned data Data that is physically
divided across many hard disks. Data may be
partitioned horizontally or vertically. The
technique improves application performance
and security. Also called Data partitioning.
Partitioning Splitting data across different
units. Partitioning may be achieved at the
system or application level.
Pilot An initial project that serves as a
model or template for future projects.
Pivoting A query technique that enables the
arrangement of rows and columns to be
changed in a report.
PL/SQL
See Procedural SQL.
Primary key
A single or multiple column
value that uniquely identifies a single row in
a relational database table.
Procedural Gateway
Middleware that
enables data on a non-Oracle database to be
viewed from Oracle applications. See Middleware, Transparent Gateway.
Procedural SQL
An extension to Oracle
SQL. It enables SQL to be embedded within
third generation programming constructs
such as GOTO and LOOP statements for
finer programming control.
.....................................................................................................................................................
Glossary-12
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Process 1. A key element of Oracle
Method. A cohesive set or thread of related
tasks that meets a specific project objective.
A process results in one or more key deliverables. 2. A sequential execution of functions
triggered by one or more events. See Oracle
Method, Data Warehouse Method (DWM).
Proof-of-concept An approach that contains
a well-defined set of objectives and is scoped
to demonstrate the immediate business benefit of an increment of the data warehouse.
See Increment.
Q
Query Manager Middleware that presents
the user querying data with an easy-to-use
and clear picture of the underlying business
data.
R
RDBMS See Relational database management system, Oracle Server.
Reach-through Used by online analytical
processing tools to access directly data on a
relational database server. The tool presents
the data in a multidimensional manner.
Reference data Data held in reference
tables. See Reference tables.
Reference tables Hold textural data that
contain expanded descriptions of data resident in dimension tables.
Referential integrity
A condition that
guarantees that the values in one column also
exist in another column. This guarantee is
enforced through the use of integrity constraints.
Refresh The process of updating the data
warehouse database objects with new data.
The refresh process occurs on a predefined
and scheduled basis after initial load. See
Initial load, Refresh cycle.
Refresh cycle The frequency by which data
in the data warehouse database objects is
updated with new data. The cycle is determined by user business requirements. Regular process of updating the data warehouse
with further fact (detail) data and creating
appropriate summary tables and data
indexes.
Relational database management system
(RDBMS)
Software that creates and
maintains the database system, as well as the
data stored in the database (in Oracle terms,
Version 6 and earlier). See Server.
Relational online analytical processing
(ROLAP) An implementation that presents
the user with a multidimensional view of
data that originates from a relational database structure.
Replication Method whereby copies of
databases are maintained at multiple sites in
a distributed system, to improve availability
and response times. Replication is frequently
employed as part of a backup and recovery
strategy.
Reports
See Oracle Reports.
ROLAP See Relational online analytical
processing.
Row A series of attributes that identify the
characteristics, to be stored on the database,
of a significant object, such as a person. Also
referred to as tuple. See Table.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-13
Glossary
.....................................................................................................................................................
S
Schema A logical representation or model
of a database structure.
Scrubbing See Cleansing.
Semiadditive A numeric fact that can be
added along some dimensions in a fact table
but not others.
Server
Software that handles the functions required for concurrent, shared access
to a database. The server receives and processes SQL and PL/SQL statements originating from client applications. The computer
that runs the Server must be optimized for its
duties. The Oracle server was previously
called the Relational database management
system. See Relationaldatabase management system.
Slice and dice A mechanism whereby a
query can analyze information along any
dimension of the multidimensional model
equally.
Slowly changing dimensions The tendency
of dimension records, particularly the product and customer dimensions, to change
gradually or occasionally over time.
Snapshots A copy (or dump) of the data in a
database at any given point in time.
Snowflake model A normalized version of
the star model, employed in data warehouse
implementations. See Star model, Constellation model.
Source data The data that is used as the
basis of warehouse data, maybe from a database, flat files, or magazine articles. Also
called data source.
SQL*Loader
An Oracle tool that enables
streams of data to be loaded into files or a
database.
SQL (Structured Query Language) The
internationally accepted standard language
for relational systems. See Data Manipulation Language, Data Definition Language.
SQL statement
A complete command or
statement written in the SQL language.
Staging area A file, operational data store,
or series of relational database server tables
that contains the data to be moved to the
warehouse.
Star query Optimization technique that
enables the dimensions and fact tables in the
star model to be accessed efficiently, and
data to be returned to the user efficiently. It
ensures that the dimension data is visited
first, and the fact data last and only once.
Star model
A database organization in
which a fact table with a composite key is
joined to a number of single-level dimension
tables. The model is used in data warehouse
implementations. See Constellation model,
Snowflake model.
Subject area A vertical portion of the business, such as Sales and Marketing, that is
developed as an iteration of the enterprisewide data warehouse.
Summary data
Data that is aggregated
and stored in a summary fact table and made
available to the user for direct and easy
access.
Summary table A data structure in the
warehouse that contains summarized (or
aggregated) facts. See Summary data.
.....................................................................................................................................................
Glossary-14
Data Warehousing Fundamentals
Glossary
.....................................................................................................................................................
Symmetric Multiprocessor (SMP) A
shared everything hardware and software
architecture, where memory and disk controllers are accessible to all CPUs. See CPU.
System Global Area (SGA) A large area of
memory allocated to a database instance for
caching. See Cache.
T
Table A relational database structure that
comprises vertical columns (attributes) and
horizontal rows (tuples) of data. See Primary
key, row, and column.
Terabyte
Usage curve A line chart showing the
amount of CPU used at any time during normal system activity.
User A person at any level of the organization who needs to access the data in the data
warehouse for information in order to perform a business function.
User metadata The information provided to
users that allows them to understand and
access warehouse data. It focuses on what
data is in the warehouse, how it was transformed, the source, and the timeliness of the
data. See Business metadata and Transformation.
One trillion bytes.
Time stamp A date and time value written to
a record when it is created or changed in the
database.
Transformation The process of redefining
data based on predefined rules, using specific formulas and techniques. Also called
data transformation. See ETT.
V
Very large database (VLDB) A very large
database is measured in gigabytes and Terabytes.
Very large memory (VLM) Computers with
64 bit memory structures.
VLDB See Very large database.
Transparent Gateway
Middleware that
enables viewing of data resident in a nonOracle database from Oracle applications.
See Middleware, Procedural Gateway.
Transportation The movement of data to
the warehouse server. Also called data transportation. See ETT.
VLM See Very large memory.
W
Warehouse manager
The mechanism
that maintains the data in the warehouse
database.
U
Warehouse Technology Initiative (WTI)
Uniform Resource Locator (URL) Text
used to identify and address an item in a
computer network.
An Oracle program that invites other vendors
to offer products and services that are complementary to those offered by Oracle, particularly in the area of products and services
related to data warehousing.
.....................................................................................................................................................
Data Warehousing Fundamentals
Glossary-15
Glossary
.....................................................................................................................................................
WTI See Warehouse Technology Initiative.
.....................................................................................................................................................
Glossary-16
Data Warehousing Fundamentals