Download What Is a Dimensional Data Warehouse?

Document related concepts

Clusterpoint wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Database model wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi
Kesehatan



Paulraj Ponniah. 2010. Data Warehousing
Fundamentals for IT Professional, John
Wiley & Sons.
Vincent Rainardi. 2008. Building a Data
Warehouse With Examples in SQL Server.
Apress.
William H. Inmon. 2005. Building The Data
Warehouse, Willey.

1980’s to early 1990’s
 Focus on computerizing business processes
 To gain competitive advantage

By early 1990’s
 All companies had operational systems
 It no longer offered any advantage

How to get competitive advantage??
Information
A process of transforming data
into information and making it
available to users in a timely
enough manner to make a
difference
[Forrester Research]
Data




Companies, over the years, gathered huge
volumes of data
“Hidden Treasure”
Can this data be used in any way?
Can we analyze this data to get any
competitive advantage?





Allows “efficient” analysis of data
Competitive Advantage
Analysis aids strategic decision making
Increased productivity of decision makers
Potential high ROI

Quick decisions
 “The ultimate goal is simple: Give the
battlefield commander access to all the
information needed to win the war. And give it
to him when he wants it, where he wants he
and how he wants it.”
 -- Gen. Colin L. Powell, “Information Warriors,”
BYTE, 1992



Retail

Manufacturing
 Customer Loyalty
 Cost Reduction
 Market Planning
 Logistics Management
Finacial

Utilities
 Risk Management
 Asset Management
 Fraud Detection
 Resource Managament
Airlines

Government
 Route Profitability
 Manpower Planning
 Yield Management
 Cost Control

Strategic Information needed to formulate:
 the business strategies,
 establish goals,
 set objectives, and
 monitor results.

Examples of business objectives:
 Retain the present customer base
 Increase the customer base by 15% over the next 5 years
 Improve product quality levels in the top five product groups
 Gain market share by 10% in the next 3 years
 Enhance customer service level in shipments
 Bring three new products to market in 2 years
 Increase sales by 15% in the East Division
INTEGRATED
Must have a single, enterprise-wide view.
DATA INTEGRITY
Information must be accurate and must
conform to business rules
ACCESSIBLE
Easily accessible with intuitive access
paths, and responsive for analysis.
CREDIBLE
Every business factor must have one and
only one value.
TIMELY
Information must be available within the
stipulated time frame.

Ease
 It combines information from different, separate systems
in one location  easy to access.

Speed
 DW tables are specifically designed for quick response
time, and handle large quantities of data.
 Report and other data are precalculated

Reliability
 DW is read-only database  stability over time.

Flexibility
 Utilizing BI Tools

Data warehousing is a simple concept
 It is born out of the need for strategic
information and is the result of the search
for a newway to provide such information.


An Environment, Not a Product
A Blend of Many Technologies




A data warehouse is not a single software or hardware
product you purchase to provide strategic information.
A computing environment where users can find
strategic information,
an environment where users are put directly in touch
with the data they need to make better decisions.
It is a user-centric environment.

Characteristics of new computing
environment called the data warehouse:
 An ideal environment for data analysis and decision




support
Fluid, flexible, and interactive
100% user-driven
Very responsive and conducive to the ask–answer–ask
again pattern
Provides the ability to discover answers to complex,
unpredictable questions

The basic concept of data warehousing is:
 Take all the data from the operational systems.
 Where necessary, include relevant data from outside, such
as industry benchmark indicators.
 Integrate all the data from the various sources.
 Remove inconsistencies and transform the data.
 Store the data in formats suitable for easy access for
decision making.


A decision support database that is
maintained separately from the organization’s
operational databases.
A data warehouse is a




subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making

“A collection of integrated, subjectoriented
databases designed to supply the
information required for decisionmaking.”
-- W. Inmon (1992)
“A data warehouse is a system that retrieves
and consolidates data periodically from the
source systems into a dimensional or
normalized data store. It usually keeps years of
history and is queried for business intelligence
or other analytical activities. It is typically
updated in batches, not every time a
transaction happens in the source system.”
-- Vincent Rainardi (2005)
“A data warehouse is simply a
single, complete, and consistent
store of data obtained from a
variety of sources and made
available to end users in a way
they can understand and use it in
a business context.”
Barry Devlin, IBM Consultant
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
Analyze
Query


The primary concept of data warehousing is that the data
stored for business analysis can most effectively be accessed
by separating it from the data in the operational systems.
Fundamental differences between operational and
informational (DW) environment:
 Nature of the data
 Development cycle
 Supporting technology
 User community
 Processing characteristics





Subject-Oriented Data
Integrated Data
Time-Varying Data
Nonvolatile Data
Data Granularity


Data Warehouse is designed around
“subjects” rather than processes
A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System

DW will have a Sales Subject Area



Heterogeneous Source Systems
Little or no control
Need to Integrate source data
 For Example: Product codes could be different
in different systems

Arrive at common code in DW


Most business analysis
has a time component
Trend Analysis (historical
data is required)

In a data warehouse it is efficient to keep data summarized
at different levels.
 Depending on the query, you can then go to the particular level
of detail and satisfy the query.

Data granularity in a data warehouse refers to the level of
detail.
 The lower the level of detail, the finer is the data granularity.
 If we want to keep data in the lowest level of detail, we have to
store a lot of data in the data warehouse.

We will have to decide on the granularity levels based on the
data types and the expected system performance for
queries.



Data granularity refers to the level of detail.
Depending on the requirements, multiple levels of detail may be present.
Many data warehouses have at least dual levels of granularity.






Extract, Transform, Load (ETL) tools
DW databases & DBMS tools
Data marts
Meta data
DW administration & management tools
Information delivery system



Data Extraction
Data Cleaning
Data Transformation
 Convert from legacy/host format to warehouse
format

Load
 Sort, summarize, consolidate, compute views,
check integrity, build indexes, partition






Consumes 70-80% of project time
Heterogeneous Source Systems
Little or no control over source systems
Source systems scattered
Different currencies, measurement units
Ensuring data quality






A storage area where extracted data is
cleaned, transformed and deduplicated.
Initial storage for data
Need not be based on Relational model
Mainly sorting and Sequential processing
Does not provide data access to users
Analogy – kitchen of a restaurant

Commercial tools:







Warehouse Builders (Oracle)
MS Data Transformation Services
SSIS (Microsoft)
DataStage
SAS ETL Server
Typical functions
Define source, query (run SQL), define
transformation, define target, verify
transformation, schedule run, audit report



Almost always a relational DB
Oracle, DB2, Sybase, SQL Server
New DB design for special purpose of DW
(e.g., scale up, speed up, parallel
processing)



OLTP Systems are Data Capture Systems
“DATA IN” systems
DW are “DATA OUT” systems



Design of the DW must directly reflect the
way the managers look at the business
Should capture the measurements of
importance along with parameters by
which these parameters are viewed
must facilitate data analysis, i.e.,
answering business questions




A logical design technique that seeks to
eliminate data redundancy
Illuminates the microscopic relationships
among data elements
Perfect for OLTP systems
Responsible for success of transaction
processing in Relational Databases
ER models are NOT suitable for DW?
 End user cannot understand or remember
an ER Model
 Many DWs have failed because of overly
complex ER designs
 Not optimized for complex, ad-hoc queries
 Data retrieval becomes difficult due to
normalization
 Browsing becomes difficult

Most relational databases are set to 3rd normal
form
 1st Normal form – Tables have unique keys and no
repeating groups or multi-value fields
 2nd Normal form – Every attribute is dependent ont
the entire key of the table
 3rd Normal form – Attributes are dependent only on
the key. No derived elements

Business needs to analyze data so that it can:
 Understand trends
 Predict future behavior and needs
 Personalize contact with customers
 Be competitive

All of this in a speedy manner, with the ability to
do “What if’s”

Data is not structured for analytical usage

Multiple Joins are resource intensive

Missing data from external sources, context
history, not operational sources
“A structured repository of validated and
integrated historical information accessible
to business people to provide the basis for
both tactical and strategic business
decisions.”




Centralized extract and staging
Separate from operational system
Structured for analysis
Historically contexted
Relational Data
External Data
Enterprise Data
Data Distribution
Acquisition, Staging, Cleaning,
Transformation
Data Warehouse
Storage
Analytical
Applications

Detail Level
 Dimensional Normal form
 Value and feasibility

Analytical Level
 Structured for the required analyses

Summary Level
 Summaries for user requirements
 Better response time

Normalized for maintainability

De-normalized for performance, based on
rules

2 level structure, therefore only one level of
joins required for queries

Subject
 Fact
 Dimension
▪ Aspect / Factor
▪ Level of reality
▪ Lifelike quality





Facts are stored in FACT Tables
Dimensions are stored in DIMENSION
tables
Dimension tables contains textual
descriptors of business
Fact and dimension tables form a Star
Schema
“BIG” fact table in center surrounded by
“SMALL” dimension tables






Measures or facts
Facts are “numeric” & “additive”
For example; Sale Amount, Sale Units
Factors or dimensions
Star Schemas
Snowflake & Starflake Schemas



Data mart = subset of DW for community
users, e.g. accounting department
Sometimes exist as Multidimensional
Database
Info mart = summarized data + report for
community users



Data about data
Field description, business rules (e.g.
profit=? formula), log of file updates
Help users understand content & locate
data




Security & priority
Keep track of updates
QC
Purging & copy to data mart


Security issue critical (users at many levels)
Some security measures to protect a DW
 Views = limit users to see certain rows/columns
 Access control = grant rights to specific users to
access selected data (can be created by DBA
thro’ SQL commands such as Grant/Revoke)
 Admin controls such as group access, firewall,
encryption
 Audit = track what users are doing

Tools
 Query & reporting
 OLAP
 Data mining, visualization, segmentation,
clustering
 New developments: text mining, web mining &
personalization
 Mining multimedia data

Commercial tools
 Ms SQL Server Business Intelligence, Oracle
Business Intelligence Suite, Crystal Report,
Cognos Solution, WebFocus

Increasingly common mode of delivery:
 Web-enabled
Thank you