Download Introduction to Data Warehousing Overview What is a data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

SAP IQ wikipedia , lookup

Data analysis wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Introduction to Data
Warehousing
Peter O’Donnell
DSS Lab, Monash University
Overview
n
n
What is a data warehouse?
What makes it so different?
– Managers as clients
– Architecture
n
Dimensional Modelling
– Compared to traditional data modelling
– Facts and dimensions
– OLAP
What is a data warehouse?
“Subject oriented, integrated, time variant, non-volatile
collection of data in support of management decision
making”
Inmon
“The basic data warehouse architecture interposes
between end-user desktops and production data
sources a warehouse that we usually think of as a
single, large system maintaining an approximation of
an enterprise data model.”
Demarest
1
Data Warehouses
n
n
n
n
A set of databases created to provide
information to decision makers
Supports the access, understanding and
analysis of data by decision makers
Provides the “data infrastructure” for
management support systems (eg. DSS and
EIS)
Most of the effort is in data extraction,
transformation and load activities
Another view...
“Data warehousing is a process not a product. ... The data
warehousing process can be broken down into 4 phases:
Assemble data systematically
Transform the data, correct errors and form a
consistent view
Distribute the data where needed
Furnish high speed tools of choice
Data warehousing provides a means for the useful storage of
historical information allowing the user wider scope on which to
base decision support information.”
The Butler Group
What’s so different about data
warehouses?
n
Compared to operational systems
(OLTP):
– Managers as clients
• What managers supposedly do
• The reality
– Architecture
2
Managers as clients
n
n
n
n
n
n
Discretionary and demanding clients
Chauffeured
Fragmentation, brevity and variety
Uncertain tasks
Urgency
Organisationally powerful
What’s in it for managers?
n
n
n
n
n
Fast access to data
Views of the organisation they have
never had before
Exception reports (data mining agents)
Infra-structure for EIS
Infra-structure for DSS
What’s really in it for managers?
n
n
n
Beware “technocratic utopianism”
Maybe nothing at all!
Ackoff (1967) revisited:
– MIS are based on the following false assumptions:
• More information is better
• Managers don’t have the information they need
• Managers need the information they want
• Managers don’t have to understand a system to
use it
n
http://images.lib.monash.edu.au/ims3001/04103275. pdf
3
Operational Systems
Environment
n
OLTP Systems tend to be:
– Unintegrated
– Unsynchronised
– Complex
– Update- Oriented
– Dirty data
Data Warehouse Environment
n
OLAP Systems (explained later) tend to
be:
– Subject oriented
– Integrated
– Time Variant
– Non-Volatile
(Inmon & Hackathorn, 1994)
Goals of data warehouse
architecture
n
Architectural goals (Demarest, 1994):
– To protect production systems from query drain
– To provide a traditional, highly manageable data
oriented environment for DSS
• To separate data management and query processing
issues from end-user access issues
– To enable data from different systems to be
brought together in a logical unified fashion
4
Data Warehouse Architecture
Internal
Legacy
Systems
Query System
Data
Warehouse
Special
Purpose
Data
External
Data
Sources
Executive
Information
System
Decision
Support
System
EIS Client
EIS Client
DSS Client
Research at Monash - What We
Know (The Benefits)
n
The major benefits of data warehousing
we have noted are:
– Better data management
– Better access to data
– Better decision making
– A reduction in the cost associated with the
production of ad hoc reports
n
IT professionals involved consider the
investment to be very worthwhile.
What we know: architecture
n
Organisations are using existing
technologies for their datawarehouse
– As a result the traditional vendors have a
strong presence in the market (eg. IBM,
Sun, Oracle etc.)
n
Client / Server architectures are
dominant
– However many organisations are running
their data warehouse on the same platform
as their OLTP systems.
5
What we know: project scale
n
n
n
n
The majority of projects are not enterprisewide in scale (data marts rather than data
warehouses)
A small number of systems cost many
millions of dollars but around $500,000 is
typical (but proportional to the authority of the
sponsor!)
A small number of users (~10) is common
The development team usually consists of 24 people
Where is the technology
heading?
n
Architecture
n
Project scale
– Web enablement
– More large projects
– More Users
Issues facing developers
n
n
n
n
n
n
Shortage of skilled people
Vendor support in Australia
Increasing expectations of users
Internet
Evolutionary development
Data quality (!!)
6
Some fundamentals ( Ackoff
again)
n
n
n
n
Don’t ask what people want
Managers don’t need more information
Find out what people need
Use the [warehouse] to provide better
information
Ackoff - 1967
Evolutionary development
n
n
n
n
n
Users understanding of business is
shaped by the information they have
System is developed to suit their
understanding of the business
System provides better information
Users understanding of business is
changed
System must change, ...
Data Warehouse Modelling
n
Aims
– Easily understood
– Extendable
– Stable
– Good performance for queries and reports
n
ER or Star Schema or both?
7
ER Schema (Simple)
Customer
Type
groups
Customer
within
Region
contains
makes
Product
Type
groups
Product
in
Sale
located at
Store
within
Period
(based on Kimball (1996), p29, and Simsion-Bowles (1996), p2)
Traditional ER Approach to
design
n
n
Entities and relationships
Rules of normalisation
– 3NF is typical
– Protection of integrity of database by avoiding
anomalies
– Every logical thing is represented only once
n
Separate consideration logical and physical
Traditional Database Design
n
Large numbers of tables
– Oracle Financials - 1,800; SAP 7 up to
8,000
n
Commonly used
– Feels natural once you get used to it
n
Research shows that they are not easily
understood by IT people
– Especially concepts like abstraction,
generalisation, sub-types, etc.
8
Multi-Dimensional Models
n
n
n
It is possible to conceptualise data as multidimensional
Difficult to design
Easy to use resulting reports
So what is this dimensional stuff
anyway?
n
An approach to database design that
provides an easy to understand and navigate
database
– The aim is to encourage understanding,
exploration and learning
n
Each number has a set of associated
attributes
– What it measures, what point of time it was
created, what location its from, what product its
associated with, what promotion, etc.
Multi-dimensionality
n
n
Usually talk about information spaces as
cubes or hyper cubes or n-cubes
Each attribute associated with each number
represents a dimension
– Measure, time, location, product, location, etc.
n
Resulting views are easy to navigate and
move around
– Slice and dice
– Report template
9
From Traditional Relational to
Multi-dimensional
Typical relational data -base
From Pilot Software OLAP White Paper
Same data displayed in twodimensions
Easy!
(The key is to identify the continuous
and discrete variables in the flat file.)
From a Spreadsheet to a Multidimensional report
n
n
Typical spreadsheet model
Two Dimensional?
Lurking Dimensions
n
n
What about 1997?
What about other states?
Other
dimensions are
implicit.
Year and State?
Spot the design choices!
(Time and Region)
10
What is OLAP?
n
n
On-Line analytical processing
Term was popularised by Codd in 1993
– 12 OLAP rules defining a standard by which to
assess products
– Nothing new - most products already complied
n
n
n
OLAP Council
Client/Server
Multi-dimensional view of data
OLAP and ROLAP
n
n
Many OLAP tools have their own way of
storing data (MDDB)
Some make it look like the data is in a cube
but actually query a relational database
(ROLAP)
– ‘How?’ you might ask!
Star Schema
n
Used to implement dimensional analysis
using relational database technology
Very common in data warehouse
n
Fact table
n
– Many variations
– additive and non additive facts
n
Dimension tables
– become constraints (WHERE part of SQL)
11
Star schema (with attributes)
Customer
Customer key
Name
Customer type
Sale
Product
Store
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
Product key
Product type
weight
Store key
Address
Region
Time
Time key
Day
Month
Snowflake schema
Customer
Type
Customer
Product
Type
Product
Sale
Store
Region
Time
Conversion from ER to Star
n
“Event remembered” or “transaction” entity
types become fact tables
– SALE
– SHIPMENT
– CLAIM
n
“Master” entity types become dimension
tables
– CUSTOMER
– PRODUCT
– LOCATION
12
Uses of ER and Star Schemas
n
n
ER schemas are useful for data
mapping to legacy systems and for
integration of the data warehouse
Star schemas are useful for the design
of warehouse databases as they are
efficient and easy to understand and
use
– Allow relational databases to support multidimensional data cubes
Dimensions Dimensions
n
n
n
n
Star schema might (typically) have 1015 dimensions
Individual user views of the warehouse
might include 6-7 of these
Typical systems (eg an EIS) might have
20 different views and 4-5 different base
fact tables
Dimension tables can be related to a
large number of facts
Steps in the design process
1. Choose a business process
2. Choose the grain of the fact table
Too fine > Oversized database
Too large > Loss of meaningful information
3. Choose the dimensions
4. Choose the measured facts
(usually numeric, additive quantities)
5. Complete the dimension tables
Kimball (1996)
13
Extra steps in the design process
6. Determine strategy for slowly changing
dimensions
7. Create aggregations and other physical
storage components
8. Determine the historical duration of the
database
9. Determine the urgency with which the data is
to be extracted and loaded into the data
warehouse.
Kimball (1996)
That’s it from me!
n
n
Check the web
Useful links:
– www.sims .monash.edu.au/dsslab
– www.rkimball.com
– www.dwassit.com
– www.olap.org
n
Stuff to read
– Anything by Ralph Kimball, Bill Inmon, lots
of others
14