Download The Data Warehouse and Business Intelligence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COT5230 Data Mining
Week 9
The Data Warehouse (DW) and Business
Intelligence (BI)
MONASH
AUSTRALIA’S
INTERNATIONAL
UNIVERSITY
The Data Warehouse (DW) and Business Intelligence (BI) 9.1
Lecture Outline
 Overview of Data Warehousing
 Data Warehouse Architecture
 Overview of Business Intelligence (BI)
 OLAP
The Data Warehouse (DW) and Business Intelligence (BI) 9.2
What is a DW?
 A data store to support data analysis or decision
support
– Decision support:
» a methodology to extract information from data
– Decision support system:
» an arrangement of computerized tools to assist in managerial
decision making
 Answers questions by combining historical
operational data with a business data model that
reflects business activity
 Data may come from both operational and
external sources
– external data - e.g. industry average salaries
The Data Warehouse (DW) and Business Intelligence (BI) 9.3
Data Warehouse Definitions - 1
 The information in a DW is subject-oriented, nonvolatile, and of an historic nature, and so DWs
tend to contain extremely large datasets
 The purpose of the DW is to provide the tools and
facilities to manage and deliver complete, timely,
accurate, and understandable business
information to authorized individuals for effective
business decision making
 DW implementation needs a company-wide effort
that requires user involvement and commitment
at all levels
 A successful DW implementation tracks return on
investment
The Data Warehouse (DW) and Business Intelligence (BI) 9.4
Data Warehouse Definitions - 2
 A DW is a concept not a product
– It is the compiling, assembling, and consolidating of
application data common to user communities at a
single logical point
 Typical use includes ad hoc queries, “what if”,
data matching, trend analysis and other
sophisticated information functions
 Warehouse data is typically extracted from OLTP
systems
 A DW can be described as a read-only database
that provides users with access to consolidated,
historic, or static data extracted from operational
databases, usually augmented with external data
The Data Warehouse (DW) and Business Intelligence (BI) 9.5
Operational Data vs. the DW - 1
 Integration
– Data found within the DW is ALWAYS integrated, e.g.
» encoding, measurements of attributes, etc. are standardized
 Normalized vs. denormalized
– Operational data is normalized
 Timespan
– Operational data is current
– DW data is historical
 Granularity
– Operational data is at transaction level
– DW data is at an aggregation level
The Data Warehouse (DW) and Business Intelligence (BI) 9.6
Operational Data vs. the DW - 2
 Dimensionality
– data is clustered according to functional
requirements i.e. all orders to be delivered to a
particular suburb
– data analyst requires access to all dimensions
 Use
– DW is read only
The Data Warehouse (DW) and Business Intelligence (BI) 9.7
MIS, or Before the DW
 MIS: Management Information System
 required detailed knowledge of the operational
systems
 no Business Information Directory
 data quality is ad hoc
 limited data integration from source systems
 integration and querying performed by MIS
specialists using 3+GL tools such as SAS
 or at best performing queries using SQL against
images of unintegrated operational databases
The Data Warehouse (DW) and Business Intelligence (BI) 9.8
Inmon’s 12 Rules - 1
 DW and operational environments are separated
 Integrated DW data
 DW contains historical data
 DW is snapshot data captured at particular point
in time
 DW data is subject-oriented
The Data Warehouse (DW) and Business Intelligence (BI) 9.9
Inmon’s 12 Rules - 2
 No online update
 DW SDLC is data-driven
 DW contains several levels of data - raw to
summarized
 Data sources are traced
 Meta-data is a critical component
 DW contains a charge back mechanism
The Data Warehouse (DW) and Business Intelligence (BI) 9.10
DW Architecture
Load
Authoritative
Source
Source Systems
External systems
Extract /
Enhance /
Transform
Layer
Copy mgt
Extract
Transform
Process once
Business rules
Consistency
& controls
Data
Warehouse
Customise
DataMarts
Separates
data from
application
Build data
for
appropriate
datamart
Meets specific
OLAP
requirements
Delivery to
user
Fully modelled
& documented
Parallel
process
Enterprise
single image
data view
Denormalize
for specific
use
Industry
standard
tools
Tailored
applications
where
appropriate
Value add
Business Information Directory
The Data Warehouse (DW) and Business Intelligence (BI) 9.11
Source Systems/Authoritative Source
 must first identify authoritative source data
 Authoritative Source
– atomic data from the creating/owning source system
 data propagation must be subject to a delivery
contract
 data propagation is asynchronous
– no reverse propagation
– no periodic synchronization
 delivery must have minimal impact on operational
systems
The Data Warehouse (DW) and Business Intelligence (BI) 9.12
Extract/Enhance/Transform Layer
 must create integrated and standardized data
 deduping process happens here
 denormalize into a format for direct loading into
the DW
 cleanse
– must remove semantic and syntactic inconsistencies
– return invalid data to the source system for repair
 requires a data quality process
 simple business transformations
 addition of surrogate keys and time variance
The Data Warehouse (DW) and Business Intelligence (BI) 9.13
Handling Inserts/Deltas - 1
 Scenarios
– additions to a (1) New or (2) Existing partition
– partitions are (1) Atomic or (2) Aggregates
 New partition - atomic or aggregate
– work off-line
– do summation outside of database and use efficient
tools i.e.. Syncsort or C
– then SQL*LOADER
The Data Warehouse (DW) and Business Intelligence (BI) 9.14
Handling Inserts/Deltas - 2
 Updates to an existing partition
– Atomic Partition
» Unload, Sort, Reload or
» Insert directly into DB - concurrency issues
– Aggregate Partition
R1
R2
X
1
X
2
X
3 - stored in database
R3
X
1
– Update directly to DW
– Unload and update out of the database
– Keep source data and re sort sum
The Data Warehouse (DW) and Business Intelligence (BI) 9.15
The Data Warehouse
 contains atomic data
 Star Schema structure
– contains
»
»
»
»
Facts
Dimensions
Attributes - Surrogate keys
Attribute Hierarchies
 Key Issues
– size
– data retention period - YTD
– backup and recovery
– security
The Data Warehouse (DW) and Business Intelligence (BI) 9.16
Star Schemas
 a data modeling technique used to map decision
support data into a relational database
 this structure is based on the premise that a
highly normalized data structure do not serve
advanced data analysis requirements well
DimA
Customer
Cust#
DimD
Location
Loc#
Fact Table
SALES
Prod#
DimB
Product
SalesrepID
DimC
Salesrep
The Data Warehouse (DW) and Business Intelligence (BI) 9.17
Snowflake Schemas
Customer
State
Customer
Category
Customer
Address
DimA
Customer
DimD
Location
Fact Table
SALES
Prod#
DimB
Product
SalesrepID
DimC
Salesrep
The Data Warehouse (DW) and Business Intelligence (BI) 9.18
Fact Tables
 Facts measure something of interest to an enterprise
– atomic level or transactional data
– summarization will reduce volume but may lose information
CUST#
C100
C100
PROD#
P100
P200
TOTAL
$1000
$2000
CUST#
C100
C100
PROD#
P100
P100
SALESREP DATE
S1
1/12
S2
2/12
COST
$510
$490
The Data Warehouse (DW) and Business Intelligence (BI) 9.19
Dimensions
 drill down to atomic data from dimensions or
reference tables
 A Query
– List sales of Product P100 for each State for each
Month of 1999?
Product
P#=P100
PName Nuts
PCat
Location
State=Each
Region
Time
Year=1999
Month=Each
The Data Warehouse (DW) and Business Intelligence (BI) 9.20
Attributes & Attribute Hierarchies
 each dimension table contains attributes
 surrogate keys are commonly added to improve
performance of joins between Fact tables and
their associated Dimensions
 attributes are used to search, filter of classify
facts
 Attribute Hierarchies: classification attributes, e.g.
SALES_REGION
VIC, TAS
The Data Warehouse (DW) and Business Intelligence (BI) 9.21
Datamarts/Customization/Cubes
 customization - select only the attributes and rows
of interest for export to a datamart or data cube
 apply coding techniques to the attributes of
interest suitable for search algorithm to be used
 each cell of a cube is a view consisting of an
aggregation of interest
– e.g. TOTAL_SALES
 used as a performance improving technique to
– pre aggregate groupby cells
– remove data not required for the problem at hand from
the search algorithm
The Data Warehouse (DW) and Business Intelligence (BI) 9.22
Business Intelligence & The DW
 most enterprises have a data repository to allow
data analysis to occur
 database provide enabling techniques
– efficient data storage and access
– query optimization
 80% of knowledge discovery in databases (KDD)
is the preparation of the data - this is the data
warehouse
 the evolution of the desktop, database, networks
and AI/search has made it possible to perform
KDD in commercial databases
The Data Warehouse (DW) and Business Intelligence (BI) 9.23
The BI Process - 1
 Understand and define the process
 Perform data collection and extraction
 Perform Data Cleaning and Exploration
 Data Engineering
– select attributes of interest
– select records of interest
– map attributes to suit DM algorithms
The Data Warehouse (DW) and Business Intelligence (BI) 9.24
The BI Process - 2
 Algorithm Engineering
– which algorithm to use
– ability to deal with
» quality of input
» quality of output
» performance
 Run the data mining algorithm
 Preliminary evaluation of the results
 Refine the data and the problem
 Use the results to implement a business strategy
The Data Warehouse (DW) and Business Intelligence (BI) 9.25
A BI Model
Learning
Analysis
Discovery
Pattern Recognition
Variables
Model
Prediction/
Verification
Adaptive
Modelling
Answer
Profit from targeted customers buying Product X/
Cost of Producing the Model and Predicting the Answer
= Return on Investment
The Data Warehouse (DW) and Business Intelligence (BI) 9.26
DM Techniques
 Verification Driven Data Mining Techniques
– Naive evaluation - exhaustive search
– Random walk
– ad hoc query
– OLAP
– Hypothesis testing - statistics
 Discovery Driven Data Mining Techniques
– Statistical Modeling (e.g. linear regression)
– Visualization
– Rule-based and inductive learning
– Neural networks
– Genetic algorithms (an optimization technique)
The Data Warehouse (DW) and Business Intelligence (BI) 9.27
OLAP:On-Line Analytical Processing
 an environment for the analysis of multidimensional data
– dice
– rotate
– drill-down
– rollup
 OLAP provides advanced database support
involving attribute selection, attribute encoding,
row sampling, data cleansing and allows the use
of multiple different search engines
– easy to use user-interface
– open system architecture using local processing power
The Data Warehouse (DW) and Business Intelligence (BI) 9.28
References
 Rob, P. & Coronel, C. Database Systems: Design, Implementation,
and Management, 3rd Ed., Nelson 1997
 Inmon W. H. - numerous. See
http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/ for example
 Kimball, R - numerous
 Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data
Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii
International Conference on System Sciences,1998
 Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution:
Consistent Metadata Management.
 Gray, J. et al. Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and
Knowledge Discovery 1, pp. 29-53, 1997
 Maier, D. et al. Selected Research Issues in Decision Support
Databases Journal of Intelligent Information Systems, 11 (2), pp. 169191 1998
The Data Warehouse (DW) and Business Intelligence (BI) 9.29