Download Data Warehouse - dbmanagement.info

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
DATA WAREHOUSE
Elsayed Hemayed
Data Mining Course
Outline
2





Introduction
Operational System (OLTP) Vs. Data Warehouse
(OLAP)
Data Warehouse vs. Data Marts
Data Warehouse Architecture
Data Warehouse Structure
Data Warehouse
Data, Data everywhere
3

I can’t find the data I need
data is scattered over the network
 many versions, subtle differences


I can’t get the data I need


I can’t understand the data I found


need an expert to get the data
available data poorly documented
I can’t use the data I found
results are unexpected
 data needs to be transformed from one form to other

Data Warehouse
What is a Data Warehouse?
4
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in
what they can understand
and use in a business
context.
[Barry Devlin]
Data Warehouse
What are the users saying...
5




Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the key
to understanding data over
time
What-if capabilities are
required
Data Warehouse
What is Data Warehousing?
6
Information
A process of transforming
data into information and
making it available to users
in a timely enough manner to
make a difference
[Forrester Research, April 1996]
Data
Data Warehouse
Warehouses are Very Large
Databases
7

Terabytes -- 10^12 bytes:
Walmart -- 24 Terabytes

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:
Geographic Information
Systems
National Medical Records

Zettabytes -- 10^21 bytes:
Weather images

Zottabytes -- 10^24 bytes:
Intelligence Agency Videos
Data Warehouse
Data Warehousing -- It is a process
8


Technique for assembling and
managing data from various
sources for the purpose of
answering business questions. Thus
making decisions that were not
previous possible
A decision support database
maintained separately from the
organization’s operational
database
Data Warehouse
Why Separate Data Warehouse?
9

Performance




Operational dbs designed & tuned for known transactions & workloads.
Complex OLAP queries would degrade performance for operation
transactions.
Special data organization, access & implementation methods needed for
multidimensional views & queries.
Function



Missing data: Decision support requires historical data, which operation
dbs do not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous sources:
operation dbs, external sources.
Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be reconciled.
Data Warehouse
Key Definition
10

OLTP: On Line Transaction Processing
 Describes

OLAP: On Line Analytical Processing
 Describes



processing at operational sites
processing at warehouse
“Business Intelligence” refers to reporting and
analysis of data stored in the warehouse
Data warehouse is the foundation for business
intelligence.
‘‘Data warehouse/business intelligence’’ (DW/BI)
refers to the complete end-to-end system.
Data Warehouse
Explorers, Farmers and Tourists
11
Tourists: Browse information harvested by
farmers
Farmers: Harvest information
from known access paths
Explorers: Seek out the unknown and previously
unsuspected rewards hiding in the detailed data
Data Warehouse
Data Mining works with Warehouse
Data
12

Data Warehousing provides the
Enterprise with a memory
Data Mining provides
the Enterprise with
intelligence
Data Warehouse
To summarize ...
13

Operational (OLTP)
Systems are
used to “run” a
business

Data Warehouse
The Data Warehouse
(OLAP) helps to
“optimize” the business
14
Data Warehouse vs. Data Marts
What comes first
Data Warehouse
Data Mart Vs Data Warehouse
15

Data mart is a specific, subject-oriented repository of
data that was designed to answer specific questions


Usually, multiple data marts exist to serve the needs of
multiple business units (sales, marketing, operations,
collections, accounting, etc.)
Data warehouse is a single organizational repository
of enterprise wide data across many or all subject
areas.

Data warehouse is an enterprise wide collection of data
marts
Data Warehouse
From the Data Warehouse to Data
Marts
16
Information
Less
Individually
Structured
History
Normalized
Detailed
Departmentally
Structured
Organizationally
Structured
Data
Data Warehouse
Data Warehouse
More
Data Warehouse and Data Marts
17
Sales
Finance
Mktg.
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
Data Warehouse
Characteristics of the Departmental
Data Mart
18

Sales
Finance
Mktg.




Data Warehouse
OLAP
Small
Flexible
Customized by Department
Source is departmentally
structured data warehouse
Data Mart Centric
19
Data Sources
Data Marts
Data Warehouse
Data Warehouse
Problems with Data Mart Centric
Solution
20
If you end up creating multiple warehouses, integrating
them is a problem
Data Warehouse
True Warehouse
21
Data Sources
Data Warehouse
Data Marts
Data Warehouse
22
Data Warehouse Architecture
Data Warehouse
Data Warehouse Architecture
23
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Data Warehouse
Metadata Repository
Analyze
Query
Implementing a Warehouse
24


Monitoring: Getting the data from the sources
Data Integration
 Cleansing
 Loading


Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
Data Warehouse
Monitoring
25


Source Types: relational, flat file, IMS, WWW,
news-wire, …
Incremental vs. Refresh
customer
Data Warehouse
id
53
81
111
name
joe
fred
sally
address
10 main
12 main
80 willow
city
sfo
sfo
la
new
Monitoring Techniques
26







Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Application level monitoring
Data Warehouse
Monitoring Issues
27

Frequency
 periodic:
daily, weekly, …
 triggered: on “big” change, lots of changes, ...

Data transformation
 convert
data to uniform format
 remove & add fields (e.g., add date to get history)


Standards (e.g., ODBC)
Gateways
Data Warehouse
Refresh
28


Propagate updates on source data to the
warehouse
Issues:
 when
to refresh
 how to refresh -- refresh techniques
Data Warehouse
When to Refresh?
29




periodically (e.g., every night, every week) or after
significant events
on every update: not warranted unless warehouse
data require current data (up to the minute stock
quotes)
refresh policy set by administrator based on user
needs and traffic
possibly different policies for different sources
Data Warehouse
How To Detect Changes
30


Create a snapshot log table to record ids of
updated rows of source data and timestamp
Detect changes by:
 Defining
after row triggers to update snapshot log
when source table changes
 Using regular transaction log to detect changes to
source data
Data Warehouse
Data Integration Across Sources
31
Savings
Same data
different name
Data Warehouse
Loans
Different data
Same name
Trust
Data found here
nowhere else
Credit card
Different keys
same data
Data Transformation Example
32
Data Warehouse
appl
appl
appl
appl
A - m,f
B - 1,0
C - x,y
D - male, female
appl
appl
appl
appl
A - pipeline - cm
B - pipeline - in
C - pipeline - feet
D - pipeline - yds
appl
appl
appl
appl
A - balance
B - bal
C - currbal
D - balcurr
Data Warehouse
Data Integrity Problems
33

Same person, different spellings


Multiple ways to denote company name




Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names


Ahmed, Ahmad, Ahmaad etc...
Oct 6, 6 Oct
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale


manual entry leads to mistakes
“in case of a problem use 9999999”
Data Warehouse
Data Extraction and Cleansing
34


Extract data from existing operational and legacy
data
Issues:
Sources of data for the warehouse
 Data quality at the sources
 Merging different data sources
 Data Transformation
 How to propagate updates (on the sources) to the
warehouse
 Terabytes of data to be loaded

Data Warehouse
Scrubbing Data
35




Sophisticated transformation tools.
Used for cleaning the quality of data
Clean data is vital for the success of the warehouse
Example


Ahmed Aly, Ahmad Ali, Ahmaad Aly, Ahmad Aly, etc. are the
same person
Scrubbing Tools
Apertus -- Enterprise/Integrator
 Vality -- IPE
 Postal Soft

Data Warehouse
Data Loading
36


After extracting, cleaning, validating etc. need to
load the data into the warehouse
Issues





huge volumes of data to be loaded
small time window available when warehouse can be taken off line
(usually nights)
when to build index and summary tables
allow system administrators to monitor, cancel, resume, change load
rates
Recover gracefully -- restart after failure from where you were and
without loss of data integrity
Data Warehouse
Load Techniques
37

Use SQL to append or insert new data
 record
at a time interface
 will lead to random disk I/O’s



Use batch load utility
Incremental versus Full loads
Online versus Offline loads
Data Warehouse
38
Data Warehouse Structure
Data Warehouse
Data Warehouse Structure
39

Subject Orientation -- customer, product, policy,
account etc... A subject may be implemented as a
set of related tables. E.g., customer may be five
tables
Data Warehouse
Data Warehouse Structure
40

base customer (1985-87)

custid, from date, to date, name, phone, dob
Time is  base customer (1988-90)
 custid, from date, to date, name, credit rating, employer
part of
key of  customer activity (1986-89) -- monthly summary
each table

customer activity detail (1987-89)


custid, activity date, amount, clerk id, order no
customer activity detail (1990-91)

custid, activity date, amount, line item no, order no
Data Warehouse
Data Granularity in Warehouse
41

Summarized data stored
 reduce
storage costs
 reduce cpu usage
 increases performance since smaller number of records
to be processed
 design around traditional high level reporting needs
 tradeoff with volume of data to be stored and
detailed usage of data
Data Warehouse
Granularity in Warehouse
42

Can not answer some questions with summarized
data
 Did
Ahmed call Aly last month? Not possible to answer
if total duration of calls by Ahmed over a month is only
maintained and individual call details are not.

Detailed data too voluminous
Data Warehouse
Granularity in Warehouse
43

Tradeoff is to have dual level of granularity
 Store
summary data on disks
 95%
 Store
 5%
Data Warehouse
of DSS processing done against this data
detail on tapes
of DSS processing against this data
Vertical Partitioning
44
Acct.
No
Name
Balance Date Opened
Interest
Rate
Address
Frequently
accessed
Acct.
No
Balance
Rarely
accessed
Acct.
No
Name
Date Opened
Smaller table
and so less I/O
Data Warehouse
Interest
Rate
Address
Schema Design
45

Database organization
 must
look like business
 must be recognizable by business user
 approachable by business user
 Must be simple

Schema Types
 Star
Schema
 Fact Constellation Schema
 Snowflake schema
Data Warehouse
Dimensional Modeling
46
Fact Table
Dimension Table
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
Dimension Table
store
storeId
city
Data Warehouse
Dimension Table
customer
custId
name
address
city
Fact Tables
47


Contain the metrics resulting from a business process or
measurement event, such as the sales ordering process or
service call event
Dimensional models should be structured around business
processes and their associated data sources,


This results in ability to design identical, consistent views of data
for all observers, regardless of which business unit they belong to,
which goes a long way toward eliminating misunderstandings at
business meetings
Fact table’s granularity should be set at the lowest, most
atomic level captured by the business process

This allows for maximum flexibility and extensibility.

Business users will be able to ask constantly changing, free-ranging,
and very precise questions.
Data Warehouse
Fact Table
48





Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a billion)
Access via dimensions
Data Warehouse
Dimension Tables
49



Contain the descriptive attributes and characteristics
associated with specific, tangible measurement
events, such as the customer, product, or sales
representative associated with an order being
placed.
Dimension attributes are used for constraining,
grouping, or labeling in a query.
Hierarchical many-to-one relationships are
denormalized into single dimension tables.
Data Warehouse
Dimension Table
50






Define business in terms already familiar to users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
 time
periods, geographic region (markets, cities),
products, customers, salesperson, etc.
Data Warehouse
Star Schema
51

A single fact table and multiple dimension tables
T
i
m
e
c
u
s
t
Data Warehouse
date, custno, prodno, cityname, ...
f
a
c
t
p
r
o
d
c
i
t
y
Star Schema Example
52
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97
customer
Data Warehouse
custId
53
81
111
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
store
storeId
c1
c2
c3
qty
1
2
5
amt
12
11
50
city
sfo
sfo
la
city
nyc
sfo
la
Star Schema Example
53
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
store
storeId
city
Data Warehouse
customer
custId
name
address
city
Snowflake schema
54


The tables which describe the dimensions are
normalized.
Easy to maintain and saves storage
T
i
m
e
c
u
s
t
Data Warehouse
p
r
o
d
date, custno, prodno, cityname, ...
f
a
c
t
c
i
t
y
r
e
g
Snowflake Schema Example
55
sType
store
store storeId
s5
s7
s9
city
cityId
sfo
sfo
la
tId
t1
t2
t1
region
mgr
joe
fred
nancy
sType tId
t1
t2
city
size
small
large
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Data Warehouse
Fact Constellation
56


Multiple fact tables that share many dimension
tables
Booking and Checkout may share many dimension
tables in the hotel industry
Hotels
Booking
Checkout
Travel Agents
Data Warehouse
Customer
Promotion
Room Type
Hybrid Approach
57


If a dimension is very sparse (i.e. most of the
possible values for the dimension have no data)
and/or a dimension has a very long list of attributes
which may be used in a query, the dimension table
may occupy a significant proportion of the
database and snowflaking may be appropriate
In practice, many data warehouses will normalize
some dimensions and not others, and hence use a
combination of snowflake and classic star schema.
Data Warehouse
Partitioning
58



Breaking data into several
physical units that can be handled
separately
Not a question of whether to do it
in data warehouses but how to do
it
Granularity and partitioning are
key to effective implementation of
a warehouse
Data Warehouse
Why Partition?
59


Flexibility in managing data
Smaller physical units allow
 easy
restructuring
 free indexing
 sequential scans if needed
 easy reorganization
 easy recovery
 easy monitoring
Data Warehouse
Criterion for Partitioning
60

Typically partitioned by
 date
 line
of business
 geography
 organizational unit
 any combination of above
Data Warehouse
Query Processing
61




Indexing
Parallel Query Processing
Pre computed views/aggregates
SQL extensions
 Extended
family of aggregate functions
 rank
(top 10 customers)
 percentile (top 30% of customers)
 median, mode
 Reporting
 running
Data Warehouse
features
total, cumulative totals
Metadata Repository
62

Administrative metadata











source databases and their contents
gateway descriptions
warehouse schema, view & derived data definitions
dimensions, hierarchies
pre-defined queries and reports
data mart locations and contents
data partitions
data extraction, cleansing, transformation rules, defaults
data refresh and purging rules
user profiles, user groups
security: user authorization, access control
Data Warehouse
Metdata Repository .. 2
63

Business data
 business
terms and definitions
 ownership of data
 charging policies

operational metadata
 data
lineage: history of migrated data and sequence
of transformations applied
 currency of data: active, archived, purged
 monitoring information: warehouse usage statistics,
error reports, audit trails.
Data Warehouse
Data Warehouse References
64



W.H. Inmon, Building the Data Warehouse, Second
Edition, John Wiley and Sons, 1996
W.H. Inmon, J. D. Welch, Katherine L. Glassey,
Managing the Data Warehouse, John Wiley and
Sons, 1997
Barry Devlin, Data Warehouse from Architecture to
Implementation, Addison Wesley Longman, Inc 1997
Data Warehouse
Summary
65
Introduction
 Operational System (OLTP) Vs. Data
Warehouse (OLAP)
 Data Warehouse vs. Data Marts
 Data Warehouse Architecture
 Data Warehouse Structure

Data Warehouse