Download Fundamentals of Designing a Data Warehouse

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Fundamentals of Designing
a Data Warehouse
Sensible techniques for developing a data warehousing
environment which is relevant, agile, and extensible
Melissa Coates
BI Architect, SentryOne
sentryone.com
Presentation content last updated: 2/15/2017
Blog: sqlchick.com
Twitter: @sqlchick
Fundamentals of Designing
a Data Warehouse
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
Overview of the Need for Data Warehousing
DW Design Principles
Dimension Design
Fact Design
When to Use Columnstore or Partitioning
DW Tips
SSDT ‘Database Project’ Tips
Planning Future Growth of the DW
All syntax shown is from
SQL Server 2016.
Screen shots are from
SQL Server Data Tools in
Visual Studio 2015.
Fundamentals of Designing
a Data Warehouse
Out of Scope





ETL patterns and techniques
Source control
Deployment practices
Master data management
Data quality techniques





Semantic layer, OLAP, cubes
Front-end reporting
Security
Tuning & monitoring
Automation techniques
Overview of the
Need for Data Warehousing
First Let’s Get This Straight…
Data Warehousing is not dead!
Data warehousing can be “uncool” but it
doesn’t have to be if you adopt modern data
warehousing concepts & technologies such as:
 Data lake
 Data virtualization
 Hadoop
 Hybrid & cloud
 Real-time
 Automation
 Large data volume  Bimodal environments
Transaction System vs. Data Warehouse
OLTP
Data Warehouse
Goal:
 Operational transactions
 “Writes”
Goal:
 Informational and analytical
 “Reads”
Scope:
One database system
Scope:
Integrate data from multiple systems
Example Objectives:
 Process a customer order
 Generate an invoice
Example Objectives:
 Identify lowest-selling products
 Analyze margin per customer
DW+BI Systems Used to Be Fairly Straightforward
Operational
Reporting
Operational
Data Store
Organizational
Data (Sales,
Inventory, etc)
Batch ETL
Enterprise Data
Warehouse
Data
Marts
Third Party
Data
Master
Data
OLAP
Semantic
Layer
Historical
Analytical
Reporting
Reporting Tool
of Choice
DW+BI Systems Have Grown in Complexity
Near-Real-Time Monitoring
Streaming
Data
Data Lake
Devices &
Sensors
Raw Data
Analytics
Sandbox
Curated Data
Active Archive
Social Media
Organizational
Data
Batch
ETL
Data Science
Operational
Data Store
Advanced Analytics
Hadoop
Machine
Learning
Mobile
Operational
Reporting
Enterprise Data
Warehouse
Third Party
Data
Demographics
Data
Self-Service
Reports & Models
Master
Data
Data
Marts
OLAP
Semantic
Layer
Historical
Analytical
Reporting
Reporting Tool
of Choice
Data Warehouse
Design Principles
3 Primary Architectural Areas
Data Acquisition
Data Storage
Reporting Tool
of Choice
Enterprise Data
Warehouse
OLAP
Semantic
Layer
Data Delivery
Integrate Data from Multiple Sources
Objective:
Data is inherently more
valuable once it is integrated.
Enterprise Data
Warehouse
Source Systems
Example:
Full view of a customer:
o Sales activity +
o Delinquent invoices +
o Support/help requests
Use of Staging Environment
Enterprise Data Warehouse
Star
Schema
Staging
Transformations
Source Systems
New trend: use of a
data lake as the DW
staging environment
Staging Objectives:
 Reduce load on
source system
 No changes to
source format
 A “kitchen area”
 Snapshot of
source data for
troubleshooting
Usage of a Star Schema
Dimension Table
Provides the descriptive
context – attributes with the
who, what, when, why, or how
Fact Table
Fact tables contain the
numeric, quantitative data
(aka measures)
Benefits of a Star Schema
Optimal for known reporting scenarios
Denormalized structure, structured
around business logic, is good for
performance & consistency
Decoupled from source systems: surrogate keys which have no
intrinsic meaning
Usability:
 Stable, predictable environment
 Less joins, easier navigation
 Friendly, recognizable names
 History retention
 Integrate multiple systems
Challenges of a Star Schema
Requires up-front analysis
(“schema on write”)
Difficult to handle new & unpredictable
or exploratory scenarios
Increasing volumes of data
Reducing windows of time for data loads (near real-time is challenging)
Data quality issues are often surfaced in the reporting layer
Not practical to contain *all* of the data all the time
Declare Grain of Each Table
Store the Lowest Level Detail You Have
Drill-down behavior:
Sales Totals
US Customers
$1,000
European Customers $ 750
Sales Totals
Sales Totals
US Customers
East Region
West Region
You may be forced to only store
aggregated data for extremely high data
volumes. Or, you may choose an
alternative technology (like a data lake, a
NoSQL database, or Hadoop).
$ 200
$ 800
$1,000
East Region
Customer A
Customer B
Customer C
$ 25
$ 75
$100
$200
Sales Detail
Customer C
Invoice 123
Invoice 456
Invoice 789
$ 10
$ 10
$ 5
$ 25
Dimension Design
Dimension Tables
Dimension tables: provide the descriptive context – attributes with the who, what,
when, why, or how. They should always include friendly names & descriptions.
Dimension tables can contain:
Type of Column in a Dim
Attributes
Non-additive numeric value
Example
Customer Name
Customer Value to Acquisition Cost Ratio
Numeric value used *only* for
filtering or grouping (usually
accompanied by a “band of
ranges”)
Customer Satisfaction %
Customer Satisfaction Range
90%-100%
80-89%
Less than 80%
Dimension tables should *not* contain aggregatable numeric values (measures).
Types of Dimension Tables
Most common types of dimensions:
Type of Dim Table
Type 0
Type 1
Type 2 aka Slowly
Changing Dimension
Type 6
Description
Values cannot change (ex: DimDate).
Any value which changes is overwritten; no history is
preserved.
Certain important values which change generate a new
row which is effective-dated. (Not all columns should be
type 2 - certain columns can be type 1.)
Hybrid of type 1 and 2 which includes a new column for
the important values, as well as a new row.
Types 3, 4, 5, and 7 do exist, but are less commonly utilized.
Type 1 Dimension
Original
data:
Customer Customer Customer
SK
NK
Name
AuditRow
UpdateDate
1
ABC
Brian Jones
6-4-2014
2
DEF
Sally Baker
10-1-2015
Change to Customer Name occurs.
Updated Customer Customer Customer
SK
NK
Name
data:
AuditRow
UpdateDate
1
ABC
Brian Jones
6-4-2014
2
DEF
Sally Walsh
12-2-2016
Type 2 Dimension
Original
data:
Customer Customer Customer
SK
NK
Name
AuditRow
Effective
Date
AuditRow
Expired
Date
AuditRow
IsCurrent
1
ABC
Brian Jones
6-4-2014
12-31-9999
1
2
DEF
Sally Baker
10-1-2015
12-31-9999
1
AuditRow
Effective
Date
AuditRow
Expired
Date
AuditRow
IsCurrent
Change to Customer Name occurs.
Updated Customer Customer Customer
SK
NK
Name
data:
1
ABC
Brian Jones
6-4-2014
12-31-9999
1
2
DEF
Sally Baker
10-1-2015
12-2-2016
0
3
DEF
Sally Walsh
12-3-2016 12-31-9999
1
Type 6 Dimension
Original
data:
Customer Customer
SK
NK
Customer
Name
Customer
Name
Current
AuditRow
Effective
Date
AuditRow
Expired
Date
Audit
RowIs
Current
1
ABC
Brian
Jones
Brian
Jones
6-4-2014
12-31-9999
1
2
DEF
Sally Baker Sally Baker 10-1-2015
12-31-9999
1
Change to Customer Name occurs.
Updated Customer Customer Customer Customer
SK
NK
Name
Name
data:
Current
Audit Row AuditRow
Effective Expired
Date
Date
Audit
RowIs
Current
1
ABC
Brian Jones Brian Jones
6-4-2014
12-31-9999
1
2
DEF
Sally Baker Sally Walsh
10-1-2015 12-2-2016
0
3
DEF
Sally Walsh Sally Walsh
12-3-2016 12-31-9999 1
Conformed Dimension
A conformed dimension
reuses the same dimension
across numerous fact tables:
critical for unifying data from
various sources.
Conformed dimensions
provide significant value with
‘drill across’ functionality,
and provide a consistent
user experience.
DimCustomer
Dim
Dim
FactSales
Invoice
Dim
Dim
FactAccounts
Receivable
Dim
FactCustomer
SupportRequest
Dim
Dim
Dim
Role-Playing Dimension
A role-playing dimension utilizes the same conformed dimension.
Objective is to avoid creating multiple physical copies of the same
dimension table.
FactSalesInvoice
SELECT
DateSK_InvoiceDate
FSI.SalesAmount
DateSK_PaymentDueDate
,InvoiceDate = DtInv.Date
SalesAmount
,PymtDueDate = DtDue.Date
…
FROM FactSalesInvoice AS FSI
INNER JOIN DimDate AS DtInv
ON FSI.DateSK_InvoiceDate = DtInv.DateSK
INNER JOIN DimDate AS DtDue
ON FSI.DateSK_PaymentDueDate = DtDue.DateSK
DimDate
DateSK
Date
Month
Quarter
Year
…
Hierarchies
Hierarchies are extremely useful for handling rollups, and for drill-down &
drill-through behavior.
Date Hierarchy
Geography Hierarchy
Year
Quarter
Month
Day
Country
State or Province
City
Address
Dimension Design
Inline syntax format works in the SSDT
database project which requires
“declarative development.”
No alters beneath the create.
Dimension Design
Golden rule: a
column exists in one
and only one place
in the DW.
Remove the Dim or Fact prefix
from user access layers.
Dimension Design
Use the smallest datatypes you
can use without risk of overflows
Use a naming
convention to easily
identify surrogate
keys & natural keys
Make careful
decisions on the
use of varchar
vs. nvarchar
Dimension Design
Avoid numeric data types for nonaggregatable columns such as
Customer Number.
Also useful for retaining leading 0s
or for international zip codes.
Alternatively,
could be
converted in a
view or semantic
layer. Objective is
to avoid reporting
tools trying to
sum.
Dimension Design
Default constraints are present
for non-nullable columns.
In a DW, defaults are optional
if ETL strictly controls all data
management. *Don’t let SQL
Server auto-name constraints.
Avoid ‘Or Is Null’
issues for attributes
which are commonly
used in predicates.
Dimension Design
When designing a Type 2
(or 6) dimension, only
choose the most important
columns to generate a new
row when it changes.
A ‘Current’ column (which is the same
across all rows in a Type 6 dimension) is
helpful for columns commonly used in
reporting so all history shows the
newest value.
Dimension Design
Could also be derived in views
or semantic layer. Or, computed
columns could be used.
Optionally, can store variations of
concatenated columns such as:
Name (Number)
Number - Name
Description (Code)
Code - Description
Dimension Design
Additional columns if the
Type 2 historical change
tracking is occurring.
Standard audit
columns.
The ‘Audit’ prefix
makes it clear they
are generated in the
DW not the source.
Dimension Design
All key & index suggestions
are merely a starting point. As
your DW grows, you might
have to refine your strategy
depending on ETL.
Primary key based on the surrogate key.
This is also our clustered index.
Dimension Design
The unique constraint
implicitly creates a unique
index as well, which will assist
with lookup operations.
Unique constraint, based on natural keys, defines the
“grain” of the table. It also helps identify data quality
issues & is very helpful to the SQL Server query optimizer.
Dimension Design
Use of non-Primary filegroups.
Ex: Dimensions, Facts,
Staging, Other.
Fact Design
Fact Tables
Fact tables contain the numeric, quantitative data (aka measures).
Typically one fact table per distinct business process.
Exception: “consolidated” facts (aka “merged” facts) such as actual vs. forecast
which require the same granularity and are frequently analyzed together.
Fact tables can contain:
Type of Column in a Fact
Measures
Example
Sales Amount
Foreign keys to dimension table
Degenerate dimension
3392 (meaningless integer surrogate key)
Order Number
Types of Fact Tables
Most common types of facts:
Type of Fact Table
Transaction Fact
Periodic Snapshot Fact
Accumulating Snapshot
Fact
Timespan Tracking Fact
Other facts:
Type of Fact Table
Factless Fact Table
Aggregate Facts
Description
An event at a point in time
Summary at a point in time
Summary across the lifetime of
an event
Effective-dated rows
Example
FactSalesInvoice
FactARBalanceDaily
FactStudentApplication
Description
Recording when an event did
not occur
Rollups, usually to improve
reporting speed
Example
FactPromotionNoSales
FactCapitalAssetBalance
FactSalesInvoiceSummary
Fact Design
Even if all of the SKs are the
same, avoid combining fact
tables for unrelated business
processes.
One fact table per distinct
business process.
Fact Design
The combination of SKs
might dictate the grain of the
fact table, but it may not.
Fact Design
Some data modelers prefer the
unknown member row to have its key
assigned randomly.
Default equates to the ‘unknown member’ row.
Fact Design
Optionally can use two types of
Date defaults: one in the past,
one in the future. Helps with
‘Less than’ or ‘Greater than’
predicates.
It’s also fine for a date SK to be an actual date
datatype instead of an integer.
Fact Design
Having a PK in a fact is personal
preference. Usually you don’t want
a clustered index on it though.
Foreign key constraints mitigate
referential integrity issues.
Fact Design
Measures are sparse,
therefore nullable.
0s are not stored except in a
factless fact table.
Fact Design
Natural key in a fact violates Kimball rules.
However, they are helpful for:
(1) Re-assigning SK if a lookup issue occurred
and an unknown member got assigned.
(2) Allows unique constraint on the NKs for
ensuring data integrity.
**Never (ever!) let the NKs be exposed or used
for anything besides ETL. And only create
minimum # of NKs to identify the row.**
Fact Design
The unique constraint
implicitly creates a unique
index as well, which will assist
with lookup operations.
Unique constraint, based on natural keys,
defines the “grain” of the table & helps
identify data quality issues.
Fact Design
The clustered index is
usually on a date.
Compression set on the
clustered index rather
than the table.
Fact Design
Nonclustered index on each
surrogate key. Useful for
smaller fact tables (which
don’t justify a clustered
columnstore index).
When to Use
Columnstore Indexes or Partitioning
Handling Larger Fact Tables
Useful for:
 Reducing data storage due to compression of
Clustered
redundant values
Columnstore  Improving query times for large datasets
Index
 Improving query times due to reduced I/O
(ex: column elimination)
Useful for:
 Improving data load times due to partition switching
Table
 Flexibility for maintenance on larger tables
Partitioning  Improving query performance (possibly) due
parallelism & partition elimination behavior
Clustered Columnstore Index
Rowstore:
Columnstore:
Reduced storage for low
cardinality columns
Simplified & conceptual
Clustered Columnstore Index
Simplified & conceptual
CCI most
suitable for:
 Tables over 1
million rows
 Data structured in a denormalized star schema format (DW not OLTP)
 Support for analytical query workload which scans a large number of
rows, and retrieves few columns
 Data which is not frequently updated (‘cold’ data not ‘hot’)
 Can selectively be used on insert-oriented workloads (ex: IoT)
(A nonclustered columnstore index targets analytical queries on an OLTP rather than a data warehouse.)
Partitioned Table
Useful for:
 Speeding up ETL processes
Table A
 Large datasets (50GB+)
 Small maintenance windows
 Use of a sliding window
Partition 1
Partition 2
Partition 3
Current
Data
Current-1
Data
Current-2
Data
 Older (cold) data on cheaper
storage
 Historical data on read-only
filegroup
 Speeding up queries (possibly)
 Partition elimination
 Parallelism
Filegroup
1
Filegroup
2
Filegroup
3
 Storage of partitions on separate
drives (filegroups)
High-end
storage
Slower
storage
Partitioned View
Useful for:
 Query performance (similar to
partitioned table)
 Sharing of a single table (“partition”)
across multiple views
 Displaying info from > 1 database or
server (via a linked server)
Requires “Check” constraints
on the underlying tables
(usually on a date column)
Data Warehouse Tips
Handling Many-to-Many Scenarios
Classic many-to-many scenarios:
 A sales order is for many products, and a product is on many
sales orders
 A customer has multiple bank accounts, and a bank account
belongs to multiple customers
DimAccount
DimCustomer
Bridge
CustomerAccount
Ways to Track History in a DW
Most common options for tracking history:
1. Slowly changing dimension
2. Fact snapshot tables
3. Timestamp tracking fact
New option in SQL Server 2016:
4. Temporal data tables  Not a full replacement for slowly changing
dimensions, but definitely useful for auditing
“Smart Dates” vs. “Dumb Dates” in a DW
DimCustomer
A “dumb date” is just an attribute:
A “smart date” relates to a fullfledged Date dimension which
allows significant time analysis
capabilities:
CustomerSK
CustomerNK
CustomerAcquisitionDate
…
DimCustomer
DimDate
FactCustomerMetrics
CustomerSK
DateSK_CustomerAcquisition
…
Handling of Nulls in Dimensions
Rule of thumb is to avoid nulls in attribute columns.
What happens with this:
Remember the
NOT NULL and
default
constraints
SELECT CustomerType WHERE CustomerType <> ‘Retail’
Too easy to forget:
SELECT CustomerType WHERE CustomerType <> ‘Retail’
OR CustomerType IS NULL
Handling of Nulls in Facts
Best practice is to avoid nulls in foreign keys. (However, nulls are ok for
a measure.)
By using an ‘unknown member’ relationship to the dimension, you can:
 Safely do inner joins
 Allow the fact record to be inserted & meet referential integrity
 Allow the fact record to be inserted which avoids understating
measurement amounts
Ex: Just because one key is unknown, such as an EmployeeSK for who
rang up the sale, should the sale not be counted?
Views Customized for Different Purposes
Recap of Important DW Design Principles
Staging as a “kitchen” area
Integrate data from multiple systems to increase its value
Denormalize the data into a star schema
A column exists in one and only one place in the star schema
Avoid snowflake design most of the time
Use surrogate keys which are independent from source systems
Use conformed dimensions
Know the grain of every table
Have a strategy for handling changes, and for storage of history
Store the lowest level of detail that you can
Use an ‘unknown member’ to avoid understating facts
Transform the data, but don’t “fix” it in the DW
Structure your dimensional model around business processes
Recap of Important DW Design Principles
Design facts around a single business event
Always use friendly names & descriptions
Use an explicit date dimension in a “role-playing” way
Utilize bridge tables to handle many-to-many scenarios
Plan for complexities such as:
Header/line data
Semi-additive facts
Multiple currencies
Multiple units of measure
Alternate hierarchies and calculations per business units
Allocation of measures in a snowflake design
Reporting of what didn’t occur (factless facts)
Dimensional only analysis
SSDT “Database Project” Tips
Database Project Format
This project is
organized by:
1 – Schema
(or Category)
2 – Object Type
3 – Object
Building the Database Project
Build frequently
to verify no
errors or
missing
references
Nearly all
objects should
be set to Build
Database Design
Pre-sized files
Auto-grow
allowed in sizeable
increments
(just in case)
Separate disks
Separate Data
to locate data
& Log drive.
& log
Unknown Member Row
The SK reference in a fact table if the real value
is unknown or does not exist.
Identity_Insert does
require elevated
permissions
Build action =
none since this is
DML
Manually Maintained Data
Maintain a DML script in a Lookup (LKP) table instead of hard-coding
in the ETL.
Build action =
none since this is
DML
Schema Compare
Settings to exclude permissions,
users, etc + options to ignore
Saved settings
Schema Compare Options
Project Properties
Option to
generate error
during build
Schema Compare
Usually
don’t
want to
let the
target
update
directly
Generates a script to use
for deployment
Data Compare
Basic functionality to
compare data between
two tables -- schema
must match.
Project Snapshot
Snapshot of the database
schema at a point in time
(ex: major release points).
Store the
.dacpac file in
the project if
desired
Planning Future Growth
of the Data Warehouse
Modern /DW/BI/Analytics Systems
Near-Real-Time Monitoring
Streaming
Data
Data Lake
Devices &
Sensors
Raw Data
Analytics
Sandbox
Curated Data
Active Archive
Social Media
Organizational
Data
Batch
ETL
Data Science
Operational
Data Store
Advanced Analytics
Hadoop
Machine
Learning
Mobile
Operational
Reporting
Enterprise Data
Warehouse
Third Party
Data
Demographics
Data
Self-Service
Reports & Models
Master
Data
Data
Marts
OLAP
Semantic
Layer
Historical
Analytical
Reporting
Reporting Tool
of Choice
Growing your DW/BI/Analytics Environment
Cloud &
Hybrid
Platforms
Real-Time
Reporting
Modern DW
Multi-Platform
Architecture
Advanced
Analytics
SelfService
BI
Agile,
Nimble
Solutions
Achieving Extensibility in a DW
Design with change in mind. Ex: Create a lookup table with
code/descriptions, or implement in a view, rather than hard-coding in ETL.
Plan for a hybrid environment with multiple architectures.
Introduce conformed dimensions first whenever possible.
Try to avoid isolated “stovepipe” implementations unless the isolation
is absolutely intended.
Conduct active prototyping sessions with business users to flush out
requirements. A data modeling tool like Power BI works well for this.
Achieving Extensibility in a DW
Be prepared to do some refactoring along the way. Ex: converting an
attribute to be a conformed dimension.
First implementation:
DimCustomer
FactSalesInvoice
CustomerName
CustomerRegion
…
Updated in a later iteration:
DimRegion
FactWarrantyRequest
FactSalesInvoice
DimCustomer
Achieving Extensibility in a DW
Introducing new measures:
• Can be a new column in a fact table as long as it’s the same grain & the
same business process
Introducing new attributes:
• Can be a new column in a dimension, or
• Can be via a new foreign key in a fact table as long as it doesn’t affect
the grain
Agility for the things that usually require the most time investment:
• Data modeling
• ETL processes
• Data quality
Achieving Extensibility in a DW
Speed of Change Implemented
Reusability Downstream
DW
OLAP
Reports
Consider using an OLAP cube or in-memory model (like Analysis
Services) for:
• Summary data (as opposed to summary tables in your DW)
• Year-to-Date type of calculations
• Year-over-Year type of calculations
• Aggregate level calculations (as opposed to row-by-row calculations)
Modern DW: Important Concepts to Know
Polygot
Persistence
Lambda
Architecture
Schema on
Read
Using the most
effective data
storage technology
to handle different
data storage needs
Data processing
architecture
which supports
large amounts of
data via a speed
layer, batch layer,
and serving layer
Data structure is
applied at query
time rather than
when the data is
initially stored
Recommended Resources
Read
First
Read
Second
Thank You for Attending
To download a copy of this presentation:
SQLChick.com “Presentations & Downloads” page
Melissa Coates
BI Architect, SentryOne
sentryone.com
Creative Commons License:
Attribution-NonCommercial-NoDerivative Works 3.0
Blog: sqlchick.com
Twitter: @sqlchick