Download Week5 - Information Management and Systems

Document related concepts

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data-Driven Business
Intelligence Systems:
Part I
Week 5
Dr. Jocelyn San Pedro
School of Information Management &
Systems
Monash University
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Lecture Outline







Data-driven BIS
Data warehouse
Data warehouse architectures
Entity-Relationship Modelling
Multi-dimensional Modelling
Star Schema
An Example: Retail Trading
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
2
Learning Objectives
At the end of this lecture, the students will


Have better understanding of concepts, tools and
technology underlying data-driven business
intelligence systems
Have knowledge of multidimensional modelling and
star schema for data modelling for data warehouses
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
3
Data-Driven Business
Intelligence Systems
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
4
Data-Driven BIS
Data-driven BIS
 information systems that provide BI through access
and manipulation of large databases of structured data
 includes tools for
 “drill down” for more detailed information
 “drill up” for broader, more summarised view
 “slice and dice” for a change in data dimensions
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
5
Data-Driven BIS
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
6
Data-Driven BIS
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
7
Data-Driven BIS
“Slicing” the
$225,764
cube
$201,196
$162,504
$182,500
$123,033
$72,528
$116,963
$75,048
$68,792
P
r
o
d
u
c
t
Peacock
Leverling
Fuller
Davolio
Callahan
Suyama
King
Dodsworth
Buchanan
Time
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
8
Data-Driven BIS
“Dicing” the
cube
$225,764
$201,196
$162,504
$182,500
$123,033
$72,528
$116,963
$75,048
$68,792
Q3
$23,181
Q2
$16,035
Q1
$6,858
Suyaman
King
Dodsworth
Buchanan
$22,719
Peacock
Leverling
Fuller
Davolio
Callahan
Q4
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
9
Data Warehouse
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
10
Data Warehouse
A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management’s decision




making process – Bill Inmon (1995)
Subject-oriented: focus is on subjects related to
business or organisational activity like customers,
employees, suppliers (instead of applications-oriented
(finance, marketing, production)
Integrated: data from various databases is stored in a
consistent format through use of naming conventions,
domain constraints, physical attributes and
measurements
Time-variant: associating data with specific points in
time
Nonvolatile: data does not change once it is in the
data warehouse and stored in data warehouse
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
11
Data Warehouse
Data warehouse is a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball (1996)
Data warehouse is a specific database designed
and populated to provide decision support in
an organisation - Gray and Watson (1998)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
12
Data Warehouse
Data warehousing emerged as result of
 improvements in database technology – relational data
model and relational database management systems
(DBMS)
 advances in computer hardware - emergence of
affordable mass storage and parallel computer
architectures
 emergence of end-user computing, facilitated by
powerful, intuitive computer interfaces and tools
 advances in middleware products that enable
enterprise database connectivity across
heterogeneous platforms
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
13
Data Warehouse
 triggered by recognition of fundamental
differences between operational (or production)
systems and informational (or decision
support) systems
 Operational system – system that is used to run a
business in real time, based on current data – e.g.
sales order processing, reservation systems, patient
registration
 Informational systems – designed to support
decision making based on stable point-in-time or
historical data; for complex read-only queries or
data mining applications – e.g. sales trend analysis,
customer segmentation, human resources planning
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
14
Data Warehouse
Comparison of Operational and informational Systems –
McFadden, Hoffer and Prescott 1999
Characteristic Operational
Informational
Primary purpose Run the business on a
current basis
Support managerial
decision making
Type of data
Current representation
state of the business
Historical or point-in(snapshots)
Primary users
Clerks, salespersons,
administrators
Managers, business
analysts, customers
Scope of usage
Narrow vs. simple
and queries
Broad vs. complex
and analysis
Design goal
performance
Ease of access and use
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
15
Data Warehouse
Architectures
Generic two-level architecture
Source
(file)
Source
database)
Transformation
and
Integration
Data
warehouse
Source
(database)
Source
(database)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
16
Data Warehouse
Architectures
Three-level
architecture
Source
(file)
Source
(database)
Source
Transformation
and Integration
Enterprise
Data
warehouse
Data mart
database)
Source
(database)
Selection and
aggregation
Data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
17
Data Warehouse
Architectures
Data mart
 a data warehouse that is limited in scope
 contains selected and summarised data to support
specific decision support applications of specific enduser group
 e.g., marketing data mart, finance data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
18
Data Warehouse
Architectures
Three-layer data architecture
Derived data
Data mart
Reconciled data
Enterprise data warehouse
Enterprise data model
Operational
data
Data
mart
metada
EDW
metadata
Operational
metadata
Operational systems
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
19
Data Warehouse
Architectures
Enterprise data model
 Presents a total picture explaining the data
required by an organisation
 Must be developed prior to designing a data
warehouse
 Entity-Relationship Models – traditional
approach in relational database design
 Multidimensional Models – are commonly
used in data warehouses and data marts for
faster retrieval for querying and analysis
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
20
Data Warehouse
Architectures
Operational Data




current or transient, not historical
restricted in scope to a particular application
poor quality
not normalised (there are multi-valued attributes or
repeating groups, partial dependencies, transitive
dependencies in data relations)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
21
Data Warehouse
Architectures
Sample Operational Data from Northwind database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
22
Data Warehouse
Architectures
Reconciled Data
 Detailed - rather than summarised
 Historical – snapshots, periodic
 Comprehensive – should reflect enterprise-wide
perspective; conform to enterprise data model
 Quality controlled
 Normalised – 3NF or higher
 3NF – no multi-valued attributes, no partial
dependencies, no transitive dependencies
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
23
Data Warehouse
Architectures
Steps in Normalisation
Table with
multi-valued
attributes
Remove multivalued attributes
1st Normal
Form
Remove partial
dependencies
2nd Normal
Form
Remove transitive
dependencies
3rd Normal
Form
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
24
Data Warehouse
Architectures
Sales relation with sample data
Relation in 3NF
Salesperson
Cust_ID
Name
Salesperson
Region
Cust_ID
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Name
Region
Salesperson
25
Data Warehouse
Architectures
Derived Data





selected, formatted, aggregated
provides ease of use for decision support applications
provides fast response for user queries
supports ad-hoc queries and data mining applications
data model commonly used is star schema
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
26
Data Warehouse
Architectures
Metadata
 data that describe the properties or characteristics of
other data
 Operational metadata – describe the data in various
operational systems (as well as external data) that feed
the EDW
 EDW metadata – describe the reconciled data layer as
well as the rules for transforming operational data to
reconciled data
 Data mart metadata – describe the data in derived data
layer and rules of transforming reconciled data to
derived data
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
27
Data Warehouse
Architectures
Sample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
28
Data Warehouse
Architectures
Sample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
29
Data Warehouse
Architectures
Data Reconciliation Process
 Stage 1: Initial load, when EDW is first created
 Stage 2: Subsequent updates
Steps in Data Reconciliation Process
 Capture – extract relevant data from source/s
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
30
Data Warehouse
Architectures
 Scrub – clean or upgrade the quality of raw data before
transformation and loading (using pattern recognition, artificial
intelligence techniques)
 Track and correct errors: misspelled names, erroneous
birthdates, missing data; inconsistent data formats
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
31
Data Warehouse
Architectures
 Transform - includes
 converting data format or representation from
source to target system
 partitioning data according to predefined criteria
 aggregating data from detailed to summary level
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
32
Data Warehouse
Architectures
Load and Index
 Refresh mode – filling
the EDW by bulk
rewriting of target data
 Update mode – only
changes in source data
are written to the data
warehouse; at periodic
intervals, data
warehouse is rewritten,
replacing previous
contents without
overwriting or deleting
previous contents
 Create necessary
indexes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
33
Entity-Relationship Modelling
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
34
Entity-Relationship Modelling
Customer
Type
groups
Customer
within
contains
makes
Product
Type
groups
Product
in
Sale
Region
located at
Store
within
Period
(based on Kimball (1996), p29, and Simsion-Bowles (1996), p2)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
35
Entity-Relationship Modelling
 Entities, attributes and relationships
 Rules of normalisation
 3NF is typical
 Protection of integrity of database by avoiding
anomalies
 Every logical thing is represented only once
 Separate consideration of logical and physical aspects
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
36
Entity-Relationship
Modelling
ER Model for the Northwind sample database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
37
Entity-Relationship Modelling
 Large numbers of tables
 Oracle Financials - 1,800; SAP 7 up to 8,000
 Commonly used
 Feels natural once you get used to it
 Research shows that they are not easily understood
by IT people
 Especially concepts like abstraction,
generalisation, sub-types, etc.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
38
Multi-dimensional Modelling
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
39
Multi-dimensional Modelling
 It is possible to conceptualise data as multidimensional
 Difficult to design
 Easy to use resulting reports
 Advocated by Ralph Kimball (see his manifesto, and
a rebuttal, available on the web site).
 A logical design technique that seeks to present data
in a standard framework that is intuitive and allows
for high-performance access.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
40
Multi-dimensional Modelling
 An approach to database design that provides an easy
to understand and navigate database
 The aim is to encourage understanding, exploration
and learning
 Each number in a database has a set of associated
attributes
 What it measures, what point of time it was created,
what location its from, what product it’s associated
with, what promotion, etc.
 This makes the number meaningful.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
41
Multi-dimensional Modelling
 Each attribute associated with each number
represents a dimension
 Measure, time, location, product, location, etc.
 Resulting views are easy to navigate and move
around
 Slice and dice
 Report template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
42
Multi-dimensional Modelling
Example Widget Sales ($Million)
One
Dimension
(State):
43.6
53.4
31.4
27.5
28.3
14.7
Vic
NSW
QLD
WA
SA
TAS
State
Two Dimensions
(location x time):
48.2
53.4
31.4
28.4
25.1
15.4
2002
50.1
57.2
33.6
28.1
22.5
16.3
2001
56.3
62.3
35.1
29.4
21.5
13.3
2000
46.2
52.1
29.6
25.1
27.1
18.2
1999
43.6
53.4
31.4
27.5
28.3
14.7
1998
Vic
NSW
QLD
WA
SA
TAS
Year
State
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
43
Multi-dimensional Modelling
Three Dimensions
(location x time x product):
Year
2002
48.2
53.4
31.4
28.4
25.1
15.4
2001
50.1
57.2
33.6
28.1
22.5
16.3
2000
56.3
62.3
35.1
29.4
21.5
13.3
1999
46.2
52.1
29.6
25.1
27.1
18.2
1998
43.6
53.4
31.4
27.5
28.3
14.7
Vic
NSW
QLD
WA
SA
TAS
State
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
44
Multi-dimensional Modelling
 Usually talk about information spaces as cubes, or
hyper-cubes, or n-cubes
 Resulting views of databases are easy to navigate and
move around
 Slicing and dicing
 Report Template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
45
Multi-dimensional Modelling
Slicing and Dicing
 Select certain dimension values to examine
a set of data:
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
46
Multi-dimensional Modelling
Report Templates
 One template is produced for a set of slices
 Data changes, layout doesn’t
Location Drop Down Box
Product Sales: Victoria, 2001
50
Year Drop Down Box
40
30
20
10
0
Widgets
Sprockets
Flanges
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Gaskets
47
Multi-dimensional Modelling
From Traditional Relational to Multi-dimensional
Typical relational data-base
From Pilot Software OLAP White Paper
Same data displayed in twodimensions
Easy!
(The key is to identify the continuous
and discrete variables in the flat file.)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
48
Star Schema
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
49
Star Schema
 Used to implement dimensional analysis using
standard relational database technology
 Very common in data warehousing
 Many variations
 Two components:
 Fact Table – contains measurements of business,
eg. sales, purchase order, shipment
 Dimension Tables – stores the textual descriptions
of the dimensions of the business, eg. product,
customer, vendor, store.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
50
Star Schema
 Fact tables store the hard data
 Dimension tables store all the information
about our dimensions.
 The fact table has a many-to-one
relationship with each dimension table
 Each dimension table has a primary key
that appears as a foreign key in the fact
table, whose primary key is a
concatenation of all of the foreign keys.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
51
Star Schema
 Dimension tables in star schemas are
denormalised resulting in:
 Fewer tables
 Simpler for users to navigate
 Reduced number of complex multi-join
tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
52
Star schema
Customer
Customer key
Name
Customer type
Product
Product key
Product type
weight
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
Store
Store key
Address
Region
Time
Time key
Day
Month
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Legend:
Primary Key
Foreign key
53
Snowflake schema
“Do not snowflake your
dimensions, even if very
large. If you do snowflake
your dimensions, prepare to
live with poor performance”
Kimball (1996)
Product
Type
Product
Customer
Type
Customer
Sale
Store
Region
Time
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
54
Star Schema
 Dimensions can be shared amongst fact tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
55
Star Schema
 ER schemas are useful for data mapping to legacy
systems and for integration of the data warehouse
 Star schemas are useful for the design of warehouse
databases as they are efficient and easy to understand
and use
 Allow relational databases to support multidimensional data cubes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
56
Star Schema
Steps in the design process
1.
2.
3.
4.
5.
Choose a business process
Choose the grain of the fact table
Too fine > Oversized database
Too large > Loss of meaningful information
Choose the dimensions
Choose the measured facts
(usually numeric, additive quantities)
Complete the dimension tables
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
57
Extra steps in the design
process
6.
7.
8.
9.
Determine strategy for slowly changing dimensions
Create aggregations and other physical storage
components
Determine the historical duration of the database
Determine the urgency with which the data is to be
extracted and loaded into the data warehouse.
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
58
An Example: Retail Trading





A large grocery store with approx. 500 stores
Each store has approx. 60,000 products on shelves
Need to maximise profit and keep shelves stocked
Important decisions concern pricing and promotion
Promotion types are:
 Temporary price reductions
 Newspaper advertisements
 Shelf and end-aisle displays
 Coupons
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
59
An Example: Retail Trading
1. Choose a Business Process
 Daily Item Movement
2. Choose the grain of the fact table
 Stock Keeping Unit (SKU) by store by promotion by
day
3. Choose the Dimensions
 Time, product, store and promotion
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
60
An Example: Retail Trading
Retail Trading Dimensions
Promotion
Promotion key
Other Promotion
attributes
Sale
Product
Product key
Other product
attributes
Time key
Product key
Store Key
Promotion Key
Facts – to be detailed
next
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
61
An Example: Retail Trading
4. Choose the measured facts
Promotion
Promotion key
Other Promotion
attributes
Sale
Product
Product key
Other product
attributes
Time key
Product key
Store Key
Promotion Key
Dollar Sales
Unit Sales
Dollar Costs
Customer Count
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
62
An Example: Retail Trading
5. Complete the dimension tables
Promotion
Promotion key
Other Promotion
attributes
Product
Product key
SKU Description
SKU Number
Package Size
Brand
Sub Category
Department
Package Type
Diet Type
Weight
Weight unit of measure
Units per retail case
Units per ship case
Cases per pallet
Sale
Time key
Product key
Store Key
Promotion Key
Dollar Sales
Unit Sales
Dollar Costs
Customer Count
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
63
References
 Inmon, W. H. (1996) Building the Data
Warehouse (2nd ed), Wiley, NY.
 Kimball, R. (1996) The Data Warehouse
Toolkit, Wiley, NY.
 McFadden, F., Hoffer, J. and Prescott, M.
(1999) Modern Database Management,
Addison-Wesley.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
64
Questions?
[email protected]
School of Information Management and Systems,
Monash University
T1.28, T Block, Caulfield Campus
9903 2735
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
65