Download warehouse_chapter13

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Transportation:
Refreshing Warehouse Data
Chapter 13
Developing a Refresh Strategy for Capturing Changed Data
Consider load window
Identify data volumes
Identify cycle
Know the technical infrastructure
Plan a staging area
Determine how to detect changes
Operational
databases
User Requirements and
Assistance




Users define the refresh cycle
IT balances requirements against technical
issues
Document all tasks and processes
Employ user skills
Operational
databases
Load Window





Time available for entire ETT process
Plan
Test
Prove
Monitor
Load Window
0
3am
6
User Access Period Load Window
9
12pm
3
6
9
12
Load Window






Plan and build processes according to a strategy.
Consider volumes of data.
Identify technical infrastructure.
Ensure currency of data.
Consider user access requirements first
High availability requirements may mean a small load
window
User Access Period
0
3am
6
9
12pm
3
6
9
12
Scheduling the Load Window
Requirements
 Load cycle

Receive
data
File
1
File FTP
2
0
3
Control File
File Names
File types
Number of files
Number of loads
First-time load or refresh
Date of file
Data range
Records in file - counts
Totals - amounts
Control
4
process
Open
and
read
files
to
verify
and
analyze
3 am
Scheduling the Load Window
6
5
Verify,
analyze,
reapply
Load into
warehouse
File
1
File
2
3 am
8
7
Index
data
Create
summaries 9
Update
metadata
Parallel
load
6 am
9 am
Scheduling the Load Window
11
10
Back up
warehouse
Create
Views for
Specialized
tools
12
13
Users
Access Publish
Summary
data
User access
6 am
9 am
Capturing Changed Data for
Refresh





Capture new fact data
Capture changed dimension data
Determine method for capture of each
Methods:
- Wholesale data replacement
- Comparison of database instances
- Time stamping
- Database triggers
- Database log
Hybird techniques
Wholesale Data Replacement
Operational
databases
T1




T2
Expensive
Limited historical data, if any
Data mart implementations
Time period replacement
T3
Comparison of Database
Instance
Yesterday’s
Operational
database
Today’s
Operational
database


Database
comparison
Delta file holds
Changed data
Simple to perform, but expensive in
time and processing
Data file:
- Changes to operational data since
last refresh
- Used by various techniques
Time and Date Stamping
Operational
data



Delta file holds
Changed data
Fast scanning for records changed since
last extraction
Date Updated field
No detection of deleted data
Database Triggers
Operation
Server
(DBMS)
Trigger
Trigger
Trigger



Changed data intersected at the server
level
Extra I/O required
Maintenance overhead
Using a Database Log
Operational
data



Operational
Server
(DBMS)
Log
Log analysis
And
Data extraction
Contains before and after images
Requires system checkpoint
Common technique
Delta file holds
Changed data
Verdict



Consider each method on merit.
Consider a hybrid approach if one
approach is not suitable.
Consider current technical, existing
operational, and current application
issues.
Applying the Changes to Data
You have a choice of techniques:
 Overwrite a record
 Add a record
 Add a field
 Maintain history
 Add version numbers
Overwriting a Record



Customer ID
John Doe
Single
Customer ID
John Doe
Married
Easy to implement
Loses all history
Not recommended
Adding a New Record
1 Customer Id John Doe Single
1 Customer Id John Doe Single
1A Customer Id John Doe Married




History is preserved; dimensions grow.
Time constraints are not required.
Generalized key is created.
Metadata tracks usage of keys.
Adding a Current Field
Customer Id
John Doe
Single
Customer Id
John Doe
Single



Married 01-JAN-96
Maintains some history
Loses intermediate values
Is enhanced by adding an Effective
Date field
Limitations of Methods for
Applying Changes



Complete history impossible
Dimensions may grow large
Maintenance overload
1234 Comer
1234 Comer
1 Main Street
200 First Ave
555-6789
222-3211
1234
Comer
1234-01 Comer
1 Main Street
200 First Ave
555-6789
222-3211
1234
Comer
1234-01 Comer
1 Main Street
200 First Ave
Effective Date
555-6789
01-Apr-93
222-3212
01-Jun-97
Maintaining History
HIST_CUST
Time
CUSTOMER
Sales
Product
One-to-many relationship
 Always retain current record
 Consistently able to refer to record
history
History Preserved



History enables realistic analysis.
History retains context of data.
History provides for realistic historical
analysis.
- Reflect business changes
- Maintain context between fact and
dimension data
- Retain sufficient data to relate old to new
Version Numbering


Avoid double counting
Facts hold version number
Customer
Time
Sales
Customer.CustId Version Customer Names
1234
1
Comer
1234
2
Comer
Customer.CustId Version Sales Facts
1234
1
11,000
1234
2
12,000
Product
Purging and Archiving Data


As data ages, its value depreciates.
Remove old data from the warehouse:
- Archive for later use
- Purge without copy
Techniques for Purging Data




TRUNCATE: Retains no rollback
DELETE: Retains redo and rollback
ALTER TABLE: Removes a partition
PL/SQL: Uses database triggers
Techniques for Archiving Data



Export to dump file from tables
Import to tables from dump file
ALTER TABLE EXCHANGE partitions
EXP
Database
IMP
.dmp
Verdict


Defined by business requirements
Must be managed
Final Tasks



Update metadata
- ETT
- User
Publish data
- Availability
- Changes
- Subject area basis
Use database roles to prevent and allow
access
Publishing Data




Control access using database roles
24-hour operation may be requested
Compromise between load and access
Consider
- Staggering updates
- Using temporary tables
- Using separate tables
ETT Tool Selection Criteria











Overlap with existing tools
Availability of meta model
Supported data sources
Ease of modification and maintenance
Required fine tuning of code
Ease of change control
Power of transformation logic
Level of modularization
Power of error, exception, resubmission features
Intuitive documentation
Performance of code
ETT Tool Selection Criteria






Activity scheduling and sophistication
Metadata generation
Learning curve
Flexibility
Supported operation systems
Cost
Transportation Tools

Information
Oracle

Platinum Technology

OpenBridge
SQL*Loader
Gateways
PL/SQL
Precompilers
InfoPump
Platinum Info Transport
Replication Server Utilities

Oracle
Symmetric and Heterogeneous
Replication
Gateways and Middleware







Brio Technology
Information Co.
Information Builders
Oracle
Platinum Technology
Prism
Software AG
DataPrism
OpenBridge
EDA/SQL
Gateways
InfoHub
Prism Manager
Entire Transaction
Propagator
Summary
This lesson discussed the following topics:
 Capturing changed data
 Applying the changes
 Purging and archiving data
 Publishing the data, controlling access,
and automating processes
 Identifying tools for transporting data
into the warehouse