Download The Data Vault - DB Best Technologies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Stream processing wikipedia , lookup

Corecursion wikipedia , lookup

Data-intensive computing wikipedia , lookup

Transcript
The Data Vault – What is it? - Why do we need it?
Monday, June 26, 2012
By: Dale T. Anderson
Principal Consultant
DB Best, Technologies, LLC
As mentioned in my last blog (Big Data & NoSQL Technologies)
understanding and using the right tools for the job is essential.
Database systems and applications are no exception. For Big
Data or Small Data, knowing and using the right data modeling
and methodology is also critical. Today, business uses the
Internet to make money yet harvesting the data in an efficient
way often develops into a considerable IT challenge. The ongoing debate behind the many issues
involved are not the focus of this blog however, the concepts presented here are.
Data Warehouse and Business Intelligence (DW/BI) Systems have become a solution of choice for
industries to address the many critical business questions ingrained in this vital data. Providing an
essential data integration process supporting a variety of reporting requirements, DW/BI Systems often
involve significant construction effort, on-going maintenance, and reengineering effort when data
structures and business rules change. And they do change; presenting us with problem #1.
Building effective and efficient DW/BI Systems can quickly become a daunting and difficult technical
ordeal even for veteran engineering teams. Several integrated technologies are required from database
systems, data transformation (ETL) tools, various programming languages, administration and reporting
software to high performance networks and computers having very large storage capacities. In many
cases even the sheer volume of data can be overwhelming; Yup, problem #2.
Once reporting is delivered to diverse user communities that consume the data simplified usability is
expected yet technological realities today often require highly trained end-users. The design, creation,
delivery, and support of a robust, effortless DW/BI System for intelligent use, sadly, are complex. You
guessed it; problem #3.
Furthermore, as upstream systems change, and as DW/BI technology plows ahead, and as the dynamic
complexities involved prevail, we also find that every so often new data sources need to be added to the
mix. These are usually unpredicted and unplanned for. The integration impact can be enormous of
requiring complete regeneration of the aggregated data; hence, problem #4.
So how do we solve these problems? Well …
Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as “a subject
oriented, nonvolatile, time-variant collection of data in support of management’s decisions”
(http://en.wikipedia.org/wiki/Bill_Inmon). Ralph Kimball, a pioneering data warehousing architect,
developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area
of decision support. The Dimensional Model (called a “star schema”) is different from Inman’s
“normalized modeling” (sometimes called a “snowflake
schema”) methodology. In Kimball’s Star Schema,
transactional data is partitioned into aggregated “facts”
with referential “dimensions” surrounding and
providing descriptors that define the facts. The
Normalized Model (3NF or “third normal form”) stores
data in related “tables” following relational database
design rules established by E. F. Codd and Raymond F.
Boyce in the early 1970’s that eliminate data
redundancy. Fostering vigorous debate amongst DW/BI
Architects as to which methodology is best, both have weakness when dealing with inevitable changes
in the systems feeding the data warehouse and in cleansing data to conform to strict methodology
requirements
Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of
data from multiple perspectives. The cube structure is created from either a Star or Snowflake Schema
stored as metadata from which one can view or “pivot” the data in various ways. Generally cubes have
one time based dimension that supports a historical representation
of data. Creating OLAP cubes can be very expensive and often
create a significant amount of data that is of little or no use. The
80/20 rule appears in many cases to hold true (where only 20% of
the OLAP cube data proves useful) which begs the question: Built
upon a traditional architecture does an OLAP cube truly deliver
sufficient ROI? Often, the answer is a resounding, NO! Durable
DW/BI systems must deliver real value.
** A Different Approach **
Data Vault Modeling is a hybrid methodology providing historical data storage from multiple sources
designed to be resilient to environmental changes. Dan Linstedt, the creator of this methodology,
describes the resulting database as:
“A detail oriented, historical tracking and uniquely linked set of normalized
tables that support one or more functional areas of business. It is a hybrid approach
encompassing the best of breed between 3NF and Star Schemas. The design is
flexible, scalable, consistent and adaptable to the needs of the enterprise.”
(http://en.wikipedia.org/wiki/Data_Vault_Modeling)
One Data Vault philosophy is that all data is relevant, even if it is wrong. Data being wrong is a business
problem, not a technical one. I agree! This simple premise of the Data Vault is to ingest 100% of the
source data 100% of the time, good, bad, or ugly. Relevant in today’s world, auditability and traceability
of all the data in the data warehouse thus become a standard requirement. This data model is
architected specifically to meet the needs of today’s enterprise DW/BI Systems:
“To understand the Data Vault is to understand the business”
Focused on the business process, the Data Vault as a data integration architecture, has robust standards
and definitional methods which unite information in order to make sense if it. The Data Vault model is
comprised of three basic table types:
HUB
LNK
SAT
- containing a list of unique business
keys having its own surrogate key
- establishes relationships between
business keys (typically hubs, but
links can link to other links)
- holding descriptive attributes that
can change over time (similar to a
Kimball Type II slowly changing
dimension)
There are several advantages to the Data Vault approach:





Simplifies the data ingestion process
Removes the cleansing requirement of a Star Schema
Instantly provides auditability for HIPPA and other regulations
Puts the focus on the real problem instead of programming around it
Easily allows for the addition of new data sources without disruption to existing schema
Take a look at this short video showing the power of the Data Vault model. Visit www.danlinstedt.com
and http://learndatavault.com for much more on Data Vault modeling and methodology.
In conclusion, the Data Vault modeling and methodology addresses most elements of the problems we
identified above:




It adapts to a changing business environment
It simplifies the DW/BI design complexities
It increases usability by business users because it is modeled after the business domain
It allows for new data sources to be added without impacting the existing design
Very Cool!
As the architect of TrinityIDC, a DV/BI cloud based solution, I can say that the more I use the Data Vault
modeling and methodology, the more I love to use it! Ask us at DB Best to help you design yours!