Download Introduction to Data Warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
Why Data Warehouse
Crisis of Credibility
Data storage is growing
Future Prediction through historical data
Intelligent Decision Support System for efficient
decision making
1
Questions for Data Warehouse
What are our five most attractive resources on the
site?
Users from what country loaded this resource most
of all over the course of the previous year?
Connections from what region generated most
outgoing traffic on the site for the last three
months?
2
Level of Aggregation for Dimensions
• Geography, which organizes the data related to the
geography locations the site users come from
• Resource, which categorizes the data related to the
site resources
• Time, which is used to aggregate traffic data across
time
3
Organization of Data to Answer the Questions
• Hierarchy of levels (with the highest level listed first):
• Geography dimensions:
– Region
– Country
• Resource dimensions:
– Group
– Resource
• Time dimensions:
– Year
– Month
– Day
4
Data warehouse
 Provide integrated and total view of the enterprise
by considering the historical data for efficient
decision making
 Operational System does not effect the Decision
Support System but collaborate and supports
decisions
 A consistent and logical view of information
across the organization for effective short-term
and long-term policies
 Presents a flexible and interactive source of
strategic information
5
Data warehouse Definition
A Data Warehouse is a subject
oriented, integrated,
nonvolatile/non-updateable and
time variant collection of data
in support of management’s
decisions.
6
Data warehouse
Subject Oriented: Organized
around key subjects in the
organization, customer,
students, patients.
7
Data warehouse
Subject Oriented:
Usually the organization store data
according to their application, like
order processing, student
enrollment, customer loans
Organized around key subjects in
the organization, customer, students,
patients.
8
Data warehouse
9
Data warehouse
Integrated Data:
Data comes from various operational
systems
Each operational system may have
various kind of data formats or naming
conventions
However, to produce effective results
the data should be standardized and
integrated into a single data warehouse
10
Data warehouse
11
Data warehouse
Time Variant Data:
Data in the data warehouse is stored
along the dimension of time
Helpful in:
Allows for analysis of past
Related information to present
Enables forecast of future
12
Data warehouse
Non-updateable/ Non-volatile:
In the data warehouse data is not
updated by end user frequently
Rather data is updated as per
business requirement after some
specific intervals such as
fortnightly, monthly.
13
Data warehouse
14
Data warehouse
15
Data warehouse Applications
Fraud Detection
Profitability Analysis
Credit Risk Prediction
Customer Retention Modeling
Yield Management
16
Data warehouse Architecture
 Data warehouse Architecture is proper arrangement of its
building blocks or main components.
 The three major components or building blocks are;
 Data Acquisition
 Data Storage
 Information Delivery
 Further divided into
Source Data
Data Staging
Data Storage
Information Delivery
Metadata
Management and Control
17
Data warehouse Architecture
18
Data warehouse Architecture
19
Data warehouse Architecture
20
Data warehouse Architecture
 Source Data is categorized as four major categories
 Production Data
 Data is accessed in Operational System in predictable way, Data warehouse has
large data, coming from various units. Main challenge is to handle the data
disparity.
 Internal Data
 Maintained as individual files and spreadsheets, adds additional complexity and
a mechanism has to be developed for acquisition of internal data
 Archived Data
 Operational System stores archived data in archived files, these backups are
essentially required for data warehousing
 External Data
 Data from External sources are required for efficient decision making, eg. A car
rental company acquires data from leading car manufacturers for fleet
management.
21
Data warehouse Architecture
Data Staging
Motivation: Data in the data warehouse
comes from different sources, and
subject-oriented, it cuts the operational
procedures as per subject of interest
Therefore, Data acquired from different
sources needs to prepared, changed,
converted and made ready for a single
source for queries and analysis
22
Data warehouse Architecture
Data Staging
Three main Operations are
Data Extraction
Data Transformation
Data Loading
In short called (ETL)
23
Data warehouse Architecture
Data Extraction
Data extraction deals with numerous data
sources where data resides in different
formats
 Some of the data is retrieved from legacy
systems
Other type of data may be from different
models like network or hierarchical
Data Extraction tools may be purchased
from market or developed in-house
24
Data warehouse Architecture
 Data Transformation
Data conversion is an important step in data
warehousing
 Data is acquired from different sources
On-going changes in the source data needs to
acquired with the passage of time
Clean Data
Correction of spellings
Resolution of conflicts between domain values e.g.
different zip codes from different data sources
Provision of missing values
Elimination of duplication of data from acquired
from different sources
25
Data warehouse Architecture
Data Standardization
Syntax Standardization
 Data types
Data Lengths
Semantic Standardization
Synonyms: Two terms for same things
Homonyms: Single terms two different
things
26
Data warehouse Architecture
At the final stage of data
transformation we achieve a
single collection of integrated
data that is cleaned,
standardized and
summarized
27
Data warehouse Architecture
Data Loading:
Initially the data is loaded in
large volume
Subsequent increment loads and
revisions are made to keep the
data warehouse updated.
28
Data warehouse or Data mart
Data Mart is a bottom up approach, Data warehouse is a top down approach
29
Data Mart
A data warehouse that is limited in
scope, whose data are obtained by
selecting and summarizing data
from a data warehouse or from
separate extract, transform, and
load processes from source data
systems.
30
Data Mart and Data warehouse
A data mart, in this practical
approach, is a logical subset of the
complete data warehouse, a sort of
pie-wedge of the whole data
warehouse. A data warehouse,
therefore, is a conformed union of
all data marts.
31
Data Mart and Data warehouse
Individual data marts are targeted
to particular business groups in the
enterprise, but the collection of all
the data marts form an integrated
whole, called the enterprise data
warehouse.
32
Three Dimensional Modeling as Informational
Cubes
33
Query Steps in an Analysis
34
OLAP functions in database without moving
data outside of database
35
OLAP Systems
(Online Analytical Processing Systems)
Definition: On-Line Analytical Processing (OLAP)
is a category of software technology that enables
analysts, managers and executives to gain insight
into data through fast, consistent, interactive
access in a wide variety of possible views of
information that has been transformed from raw
data to reflect the real dimensionality of the
enterprise as understood by the user.
Roll-up
Drill down
Slice & Dice
36
OLAP Systems
(Online Analytical Processing Systems)
OLAP is a fancy name for multi-dimensional analysis
37
Star Schema
A simple database design in
which dimensional data are
separated from fact or event data.
A dimensional model is another
name for a star schema.
38
Fact Table and Dimension Table
 A Star Schema is consists of two types of tables
 One Fact table and One or more dimension tables
 A Fact table holds factual, numerical or measured
data such as no of order booked, no of unit sold
 A dimension table hold subjective nature of data,
these attributes are used to aggregate or
summarize the data in fact table
 A data mart might contain any number of star
schema with similar dimensions but different
types of facts
39
Fact Table and Dimension Table
Typical Business Dimensions
are Products, Customer and
Time
40
A simple star schema
41
A simple star schema
42
Three dimensional display of data
43
Drill Down/ Rollup
 Rollup: Rolling up dimension to see higher level of aggregate
values
 Drill Down: Looking at more details of data though dimension
cube
44
Slice-and-Dice or Rotation
 Months are displayed as rows, products as columns and stores
as pages
 Each page consists of sale of one store
 The data model corresponds to physical cubes with these data
elements as its primary edge
 Slice or two dimensional plane of the cube
 In Normalization we analyze association between attributes and
based on those analysis group the attribute in tables to form tables
and relationships
45
Slice and Dice
46
Slice and Dice
Now rotate the cube so that products are along the Z-axis, months are along the X-axis,
and stores are along the Y-axis. The slice we are considering also rotates. What happens to
the display page that represents the slice? Months are now shown as columns and stores as
rows. The display page represents the sales of one product, namely product: hats.
You can go to the next rotation so that months are along the Z-axis, stores are along the
X-axis, and products are along the Y-axis. The slice we are considering also rotates. What
happens to the display page that represents the slice? Stores are now shown as columns
and products as rows. The display page represents the sales of one month, namely month:
January.
What is the great advantage of all of this for the users? Did you notice that with each
rotation, the users can look at page displays representing different versions of the slices in
the cube. The users can view the data from many angles, understand the numbers better,
and arrive at meaningful conclusions.
47
????????????????
48