#datavault model and understanding, back to basics

I’ve been considering many of these questions over the years, and now would simply like to share a few thoughts about the basic nature of the Data Vault Model.  I hope in future posts to share additional thoughts about the methodology.  In this post, I will try to return to basics.  Try to explain what the Data Vault model really is.  I will also attempt to share what I see the business values are of the DV model itself.  I even step in to a specific modeling case study to try to pull it apart and examine the differences on a technical level.

What is the DV model?

The DV model is: a hybrid approach.  It takes bits and pieces of normalized form models (2nd, 3rd, 4th normal forms), and bits and pieces of dimensional modeling and combines the two together.

Why?

Because there are great benefits to each modeling architecture, but also drawbacks when it comes to an enterprise level data warehouse view.

The Data Vault model quite simply put: is really nothing new or spectacular – it is however unique in the way the tables are defined and interconnected.  In other words, if you already know normalized form modeling, or dimensional modeling or both, then you can easily learn the DV model.

The Data Vault model is not a silver bullet.  no modeling architecture is, was, or ever will be.  Testament to this fact is the simple notion that we have so many different options today that meet different needs: Key=Value pair modeling, XML document modeling, Normalized Form Modeling, Graph Modeling, Node Modeling, Object Modeling, Dimensional Modeling, and so on.

What was I hoping to accomplish / trying to solve with the DV model?

One thing and one thing only: Enterprise Data Warehouse as a back-office enterprise data store.

Then came the rules or restrictions: reliability, accountability, flexibility, and so on.

How do I define an Enterprise Data Warehouse?

Well – it’s a physical component in an architecture – serves as historical data storage for raw data that is interconnected by business keys. It also (in my opinion) is good to have it separated from the delivery layers, so that the delivery layers can change without impacting the sourcing & storage, and the sourcing and storage layers can change without impacting the delivery layers.

In other words, the Raw EDW is de-coupled from both sides making it highly resilient to cascading damage resulting from changes.

in contrast: Dr Kimball defines “Data Warehouse” as a virtual thing, comprised of both the staging area, and the delivery layers to the business.  This makes it difficult (if not impossible) to pin down the actual data model to be used as a data warehouse.  Because one type of modeling fits the staging area, and another (dimensional) fits the delivery layers.

in contrast: Bill Inmon defines the “Data Warehouse” as a physical thing – but then doesn’t press you in to any specific data model to use.  It just happened to be that 3rd normal form (or normalized form modeling) was the only thing available when he began discussing “data warehouses”, so that’s what people latched on to and used.

Is there a “perfect data warehouse solution?”

No.  There never will be.  As I stated earlier, there are no silver bullets.  Everything is a progression in our industry. Everything is an evolution – and unless we all innovate we will not move forward.  The Data Vault is only one of many innovations to come along in the data warehousing space in recent years.

So what is it targeted at solving?  Why use the Data Vault?

It is targeted at solving the following (albeit some technical issues are listed here):

  • Links specifically are many-to-many tables, taken from 3rd normal form, they perform exactly the same task as the table does in 3rd normal form.  In other words: isolation and protection from definitional changes to one or more source system relationships.  In other words: “today’s business defines 1 account manager for many accounts” .  If we model that relationship and enforce it in the EDW, it will break when we try to load history.  In other words, 5 years ago, the business has auditable data that said: “1 account had 4 account managers”.
    IF we model this in to our EDW (with parent/child, or Dimensional Hierarchy) THEN we put the model and the ETL at risk!  At risk of having to be re-engineered every time the relationship changes, or a new source system is brought in, or we try to load history that says differently.
    By having a Link in place, we can absorb all the data without any impact, and without re-engineering our ETL, and without re-building our data model.  This is where some of the agility (ability to respond to users requests for changes) comes to the table.
    So again, both Dimensional (with hierarchies built in), and 3rd normal form-is models suffer from this fate of re-engineering.

So before i move on, suppose I have this dimension: (pseudo-code)

CREATE TABLE ACCOUNT_DIM (
   acct_sequence
   acct_begin_date
   acct_end_date
   acct_number
   acct_description
   acct_customer_name
   account_manager_id
   account_manager_name
)

Clearly we have modeled the relationship: 1 account manager for many accounts, IN to the model.  When we run in to historical data, or future data where this business rule no longer holds true, what do we have to do to our dimensional model to make the data fit?

Right, we have to re-engineer it.  Change the model, re-cast the history (wait, re-cast the history?) yes – re-load it in some cases, and change ALL the ETL routines that load the table.

Well then, the next step to mitigate this modeling problem came along: Snowflaking – but does that really solve the problem? and the table might now look like this:

CREATE TABLE ACCOUNT_DIM (
   acct_sequence
   acct_begin_date
   acct_end_date
   acct_number
   acct_description
   acct_customer_name
   acct_manager_sequence
)

CREATE TABLE ACCT_MGR_DIM (
   acct_manager_sequence
   acct_manager_begin_date
   acct_manager_end_date
   account_manager_id
   account_manager_name
)

No, if you look closely, the problem still exists – the re-engineering will happen regardless of the snowflaking.  Why? because the model still enforces a parent-child (1 to many) Where there still can be only 1 account manager for every account.

Ok, IS there a dimensional model that solves this problem?  Yes. There is and it might look like this:

CREATE TABLE ACCOUNT_DIM (
   acct_sequence
   acct_begin_date
   acct_end_date
   acct_number
   acct_description
   acct_customer_name
)

CREATE TABLE ACCT_MGR_DIM (
   acct_manager_sequence
   acct_manager_begin_date
   acct_manager_end_date
   account_manager_id
   account_manager_name
)

CREATE TABLE ACCT_MGR_FACT (
   acct_mgr_fact_sequence
   acct_sequence
   acct_manager_sequence
   acct_mgr_fact_begin_date
   acct_mgr_fact_end_date
)

Now, what does this look like?  Well quite frankly this starts to look like a Data Vault model.  It really isn’t rocket science, and it’s not hard to spot why we’d do this.  Dr. Kimball calls this a “factless fact”  and in some cases, others call it a “helper table”  Because it links two dimensions together.

In the Data Vault Model, this IS a Link Table.  It really is that simple.

And yes, when modeled at the granular level of detail, they both (Data Vault & Dimensional Models) solve the specific problem I have put forward.

Ok, so why continue on with the Data Vault?  Why go further? Why not just stop here?

Back to the discussion of the pain we are solving, and the benefits to be had.

  • Business Keys – or Hubs in the Data Vault are specifically builtto address changing, consolidation, and master key identification – of business keys.  Hubs in their purest form without the system fields, are in fact, 6th normal form.  In other words – the only field needed in a Hub for it to have business value is the business key.  However, due to relational database join performance, it (the system) demands that we add a surrogate sequence for speed.  Due to tracking, traceability, and other requirements, the business demands we add record sources and load dates. The same argument exists in Dimensional Models for sequences and begin & end dates.  So both modeling techniques agree on this point.
    However, that’s where it changes again.  There are some discrepancies in the Dimensional model when it comes to defining what the business key means. Particularly if a) there are multiple business keys in the dimension, b) if there are historical fields (like _1, _2, _3), c) if there are embedded hierarchies where the hierarchical data has dropped its key before loading.
    In the Data Vault, separation & identification of the business key ensures consistent granularity.  Which of course leads to the questions about the grain of the fact table when dimensions carry so much rolled up data.
  • Satellites – Descriptors in the Data Vault are specifically built to address changing descriptive content.  The Satellites provide what’s known as an inverted dimension.  Meaning that it is a dependent child on the Parent Hub or Link table.  Inverted because rather than produce it’s own sequence key, it inherits it.  The satellites allow us to vertically partition our descriptive data when we see the need.
    Columnar databases, NoSQL datastores, KV data stores, Graph Data Stores all take this approach of vertical partitioning, the Data Vault model simply makes it a viable alternative in a relational database world.  You would be hard-pressed to vertical partition a single dimension without a) adding more mini-dimensions – or b) adding snowflaking to some degree.

So, we separate the business key.  Consider the case above with the first dimension listed:

CREATE TABLE ACCOUNT_DIM (
   acct_sequence
   acct_begin_date
   acct_end_date
   acct_number
   acct_description
   acct_customer_name
   account_manager_id
   account_manager_name
)

My question to you is: What is the grain of the acct_sequence column?

Really.  What does it represent?   What happens to the acct_sequence field when there is a change to acct_description?  How about acct_customer_name?   Well, that’s all fine and well – the acct_sequence increments in this case indicating a change to the descriptors.  BUT what happens when there is a change to account_manager_name in this case?  Yes, there in lies he conundrum.  The grain shift!  You’ve changed the data in the hierarchy, one level up.  NOTHING about the account has changed at all, yet you are still forced to increment the account_sequence.

Once you’ve done this, you now need to re-assign all existing facts, or re-load all facts to represent the new dimensional record.    Or you simply leave the “old facts” tied to the old record (which is pretty much the normal thing to do), and just overcome the changes by “nesting your query” or using a hierarchical query to “go back in time for all facts for customer X regardless of account manager”

BUT…. What happens if there are records assigned to that OLD account with the OLD manager that arrive today (after the manager name has already been changed)?  Where do you put those facts?  Do you use the ETL to match BOTH business keys?  If you do not match both business keys, then you are incorrectly assigning the raw-factual data to the wrong dimensional record.    What I am saying is: the more hierarchies you have in a dimension, the more compounded this problem becomes.  The more complexity and IF conditions you introduce to the ETL to solve it, and i’m still talking about RAW data here – never mind the business rules.

What happens when you load history to this dimension where the history doesn’t match the “begin-end-dates” that you’ve assigned to the customer account?  How do you deal with these things?

You see, complexity snowballs.  But back to my point – the business keys in dimensions become diluted, and eventually near impossible to define properly. The granularity of type 2 dimensions also becomes diluted with time.

To simplify things in the Data Vault we simply separate the descriptors (satellites) from the relationships and hierarchies (links) from the keys (Hubs).  So the Hub Key has 1 and only 1 meaning / grain for ever more.

So to summarize:

Let me just sum up some things I put forward here (some are value-add, others are just simple summary points)

  • Flexibility of storing all relationships over any time and any system stem from using many-to-many Link Structures
  • Scalability on a near-linear scale using discrete mathematics and set logic results from the ability to vertically partition descriptor tables (Satellites), if you go all the way to 6th normal form, you end up with a Data Vault that looks like an anchor model, which functions like a columnar database.  That’s the easy way to say it.
  • Removal of Ambiguity – the problems with defining the grain of business keys and grain of facts, and grain of historical data goes away with the Data Vault, again due to the placement of the tables, the structure of the tables, and the overall architecture of the data vault model.
  • Removal of the need for re-engineering – all changes to the Data Vault model can easily be absorbed without re-engineering your model to suit the current needs of the business, the changes to the source systems, or the nature of the historical  and future data.
  • Master Business Key Lists – easily and purposely built in the Data Vault – not so much a function of the model, but rather this arrives as a result of the implementation methodology – so I’ll leave this one for later.  But yes, we end up with distinct and unique business key lists in master format in the Data Vault.

In the end, I am not here to sell you on Data Vault, nor am I here to convince you that you should use it.  All I am doing is presenting one of many alternatives, should you wish to try it.  I hope you found this post enlightening or at least interesting enough to give it some further thought.

There are reasons, values, and justifications for using the Data Vault – but only when you have specific pain points that are outlined here.  Also remember: the Data Vault model is a hybrid of normalization techniques and dimensional modeling techniques.  It is not new in any sense of the word, it just has new rules built around how to structure your data model.

I am looking forward to hearing your comments below.

Thanks,
Dan Linstedt

 

Tags: , , , , , , ,

2 Responses to “#datavault model and understanding, back to basics”

  1. Jonathan 2012/06/07 at 1:07 pm #

    This is a great post that I am pointing others to read. You have done a great job in outlining the basics with examples that bring it to life. It also keeps both feet on the ground because implementing EDW’s is always challenging, and the data vault model and methodology is innovation at work to allay the challenges and risks in doing so.

  2. dlinstedt 2012/06/07 at 2:33 pm #

    Thanks Jonathan, I appreciate the feedback. I will try to post a few more of these things as I go along. If you come up with any ideas that you think I should discuss, please feel free to let me know.

Leave a Reply

*