Download chp2 - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Data Warehouse Fundamentals
Lecture Objectives
 Review formal definitions of a data warehouse

 Discuss the defining features
 Distinguish between data warehouses and data marts
 Study each component or building block that makes up a data
warehouse
2
What is a Data Warehouse?
(a practitioner’s viewpoint)
 “A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of sources
and made available to end users in a way they can
understand and use it in a business context” – Barry Devlin,
IBM Consultant
 “A data warehouse is a database of data gathered from
many systems and intended to support management
reporting and decision making” – Michael Corey et al,
CTO of OneWarranty.com
3
What is a Data Warehouse?
(a Classical viewpoint)
 According to W. H. Inmon
(Building a Data
Warehouse, 1992) “A DW is
a subject oriented, integrated,
time varying, non-volatile
collection of data that is used
primarily in organizational
decision making.”
4
WHAT IS DATA
WAREHOUSING
 A data warehouse is typically a dedicated database system
for decision making that is separate from the production
database(s) used operationally. It differs from production
system in that:
 it covers a much longer time horizon than transaction systems
 it includes multiple databases that have been processed so that
the warehouse’s data are defined uniformly (i.e., ‘clean’ data)
 it is optimized for answering complex queries from managers
and analysts.
5
Standard DB v. DW
6
Characteristics of a Data
Warehouse
7
Characteristics of a Data
Warehouse
8
SUBJECT ORIENTATION
 Data is organized around major subjects of the enterprise.
9
Subject Oriented
 Data warehouses are designed to help you analyze data.
For example, to learn more about your company's sales data, you can
build a warehouse that concentrates on sales.
 Using this warehouse, you can answer questions like "Who was our
best customer for this item last year?" This ability to define a data
warehouse by subject matter, sales in this case, makes the data
warehouse subject oriented.

10
E.g. claims data are organized around the subject of claims and not by
individual applications of Auto Insurance and Workers’ Comp
Integrated
 Integration is closely related to subject
orientation.
 Data warehouses must put data from disparate
sources into a consistent format.
 They must resolve such problems as naming
conflicts and inconsistencies among units of
measure.
 When they achieve this, they are said to be
integrated.
11
Non volatile
 Non-volatile means that, once entered into the
warehouse, data are not changed/updated.
 This is logical because the purpose of a warehouse is to
enable you to analyze what has occurred.
12
Time Variant
 In order to discover trends in business, analysts
need large amounts of data.
 This is very much in contrast to online
transaction processing (OLTP) systems, where
performance requirements demand that
historical data be moved to an archive.
 The data are kept for many years so they can be
used for trends, forecasting, and comparisons
over time.
 A data warehouse's focus on change over
time is what is meant by the term time
variant.
13
Data Granularity
14
DATA MARTS
 Data Mart: A scaled-down version of the data
warehouse

 A data mart is a small warehouse designed for the
Small Business Unit (SBU) or department level.

It is often a way to gain entry and provide an
opportunity to learn

Major problem: if they differ from department to
department, they can be difficult to integrate
enterprise-wide
15
Data Warehouse and Data
Mart
16
Data Mart and Data
Warehouse
17
Data Warehouses and Data Marts
 Before deciding to build a data warehouse for your organization, you need to ask the
following basic and fundamental questions:
 Should you look at the big picture of your organization, take a top-down approach, and
build a mammoth data warehouse? Or, should you adopt a bottom-up approach, look at
the individual local and departmental requirements, and build bite-size departmental
data marts?
 Should you build a large data warehouse and then let that repository feed data into local,
departmental data marts? On the other hand, should you build individual local data
marts, and combine them to form your overall data warehouse?
 Should these local data marts be independent of one another? Or, should they be
dependent on the overall data warehouse for data feed?

These are crucial questions.
How are They Different?
 The two different basic approaches:
(1) Overall data warehouse feeding dependent data marts
(2) Several departmental or local data marts combining into a
data warehouse.
 In the first approach, you extract data from the operational
systems; you then transform, clean, integrate, and keep the
data in the data warehouse. So, which approach is best in
your case, the top-down or the bottom-up approach?
Top-Down Versus Bottom-Up
Approach
 Top-Down Approach
 The advantages of this approach are:
 A truly corporate effort, an enterprise view of data
 Inherently architected—not a union of disparate data marts
 Single, central storage of data about the content
 Centralized rules and control
 May see quick results if implemented with iterations
 The disadvantages are:
 Takes longer to build.
 High risk to failure
 Needs high level of cross-functional skills
 High outlay without proof of concept
Bottom-Up Approach
 The advantages of this approach are:
 Faster and easier implementation of manageable pieces
 Favorable return on investment and proof of concept
 Less risk of failure
 Inherently incremental; can schedule important data marts first
 Allows project team to learn and grow
 The disadvantages are:
 Each data mart has its own narrow view of data
 Permeates redundant data in every data mart
 Perpetuates inconsistent and irreconcilable data
 Proliferates unmanageable interfaces
A Practical Approach
In order to formulate an approach for your organization, you need to examine what exactly your
organization wants. Is your organization looking for long-term results or fast data marts for only
a few subjects for now? Does your organization want quick, proof-of-concept, throw-away
implementations? Or, do you want to look into some other practical approach?
Although both the top-down and the bottom-up approaches each have their own advantages and
drawbacks, a compromise approach accommodating both views appears to be practical.
The steps in this practical approach are as follows:
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete warehouse
3. Conform and standardize the data content
4. Implement the data warehouse as a series of super-marts, one at a time
In this practical approach, you go to the basics and determine what exactly your organization wants
in the long term.
The key to this approach is that:
 First plan at the enterprise level.
 Gather requirements at the overall level.
 Establish the architecture for the complete warehouse.
 Determine the data content for each supermart.
 Implement these supermarts, one at a time.
A data mart, in this practical approach, is a logical subset of the complete data warehouse, a sort of
pie-wedge of the whole data warehouse.
A data warehouse, therefore, is a conformed union of all data marts. Individual data marts are
targeted to particular business groups in the enterprise, but the collection of all the data marts
form an integrated whole, called the enterprise data warehouse.
Data Warehouse COST
 Data warehouses are not cheap

Median cost to create (does not include operating
cost) = $2.2M
 Multimillion dollar costs are common
 Their design and implementation is still an art and they
require considerable time to create.
24
Data Warehouse SIZE
 Being designed for the enterprise so that
everyone has a common data set, they are
large and increase in size with time.
 Typical storage sizes run from 50 Gigabytes
to several Terabytes
25
APPLICATION - DATA MINING
 Also known as Knowledge Data Discovery
(KDD)

26
Mining terminology refers to finding answers
about a business from the data warehouse that
the executive or analyst had not thought to ask
Data Warehouse
Architectures
27
Data Warehouse Architectures:
Basic
28
Data Warehouse Architectures:
with a Staging Area
29
Data Warehouse Architectures:
with a Staging Area and Data Marts
30
A General Architecture for
Data Warehousing
31
32
Problems and Issues
33
Data Systems Supporting DW
34