Overview of Data Vault Loading Patterns

the data vault model is full of patterns; deterministic, finite patterns.  one of the benefits and objectives of this is to allow the automatic generation of loading routines (etl / elt, sql, stored procedures etc…)  in this entry i dive a little deeper into the pattern side of loading the data vault.  what drives it, why it is architected this way, how it works – and how easy/hard it might be to setup your own “generation of code” for loading patterns.

what drives the patterns?  why is it architected this way?

it really comes down to efficiency.  when building the dv architecture and rules, it seemed like a good idea to apply design patterns.  i was knee deep in sei/cmmi level 5 at the time, and we were all about business process optimization.  i figured that one way to optimize the it business, was to ensure that the data warehouse data model could be built in a repetitive fashion.  so, the data vault model follows strict rules which allow it to adhere to common and consistent patterns.  the other driver for the model was quite simply:

i was tired of “building” mundane data flows, over and over and over again.  you see, i’m a reductionist at heart (when it comes to implementation, long-winded when it comes to blogging)… quite simply put, i had no time or money to “hire” new resources to do the grunt work for me, and we had a deadline looming.  i placed my hope in the “patterns” and rule-sets that govern the data vault model, and built my first etl code generator.

what do i need to understand in order to write my own generator?

first, be able to “classify” or categorize data models.  for instance: are they 3nf (normalized – sort of), staging (separate/distinct tables mostly), data vault (hub, link, satellite), star schema (dimension, degenerate dimension, fact).  once the model classification has been done, then table classification needs to be done (as i just mentioned, determining table type is important).  why?  because you will need one kind of code to load “dimensions”, while you need a different kind of code to load “facts”.  but in the end, all dimensions should load the same way, while all facts should load the same way.

in other words, pattern based designgs – or template driven code generation.

the data vault loading patterns are grouped as follows:

  • real-time
  • batch
    • database to database
    • flat file to database

then, they are sub-divided into additional patterns:

  • source to stage
  • source to hub
  • source to link
  • source to satellite
  • stage to hub
  • stage to link
  • stage to satellite

and so on.  understanding these patterns yields a good chance of designing an appropriate template to match, from there it’s also a matter of understanding or knowing what the properties are of the target table. 

so what’s the secret?

there really isn’t one, but if you need it: it’s a rule of thumb i learned long ago in performance and tuning across many different etl tools (running 1.1 billion rows through a single process at high speeds…. etc…)  the secret as it were is: design your etl – 1 target per data flow per action.  where action may be insert, update, delete – and the 1 target – well that speaks for itself.

not only will this simplify your data loading paradigms, it will reveal the easy to pattern loading routines – it has the added benefit of a) being restartable – when designed properly, and b) being high performant out of the gate, and c) easily parallelized or shared over a grid (partitioned).

so, discover the patterns for loading the data vault (i’ve shared some here), and follow the basic rules with the “secret” mentioned above, and you should have major success.

please feel free to comment on what’s here – offer your insights as it were.

hope this helps,
dan linstedt

Tags: , , ,

3 Responses to “Overview of Data Vault Loading Patterns”

  1. M. Streutker 2010/08/07 at 3:09 am #

    Hi Dan,

    interesting read.
    At this moment we are looking into way to define a ‘generic’ pattern to feed our DV with data.
    We’re currently thinking about creating a loading process for every supplying source system, i.e. first load all hubs/links/sats for System A, then load all hubs/links/sats for System B, etc.
    If there is a HUB ‘client’ which is supplied by both System A and System B, then this HUB can be filled in loading process A or loading process B (whichever one is first, of course).
    In this scenario, if there are ‘shared’ hubs accross systems, then these systems cannot run in parallel (to avoid collisions)
    So basically, this pattern is ‘load the data from the system’s perspective’.

    As an alternative we could choose to first load all data in the staging (both System A and System B), then load all hubs/links/sats.
    In this case HUB ‘client’ would be i.e. one ETL package with 2 dataflows (one for System A and one for System B).
    This pattern could be defined as ‘load the data from the datavault’s perspective’.

    Now the second option should be more scalable (more options to run systems in parallel), but you’re also dependant on all source systems while loading. I.e. you can only start to fill the DV once the staging area is ready (and the SA will be ready when ALL systems are processed).

    What I like about the first option, is the clear separation of systems while loading the datavault, but unfortunately I’ve been unable to find some sort of documentation if this is a recommendable pattern.
    Could you give your opinion on this case (loading source->staging->DV per system vs. loading source->staging->DV for all systems simultaneously).
    Any feedback is greatly appreciated.

    Kind regards,

    M. Streutker

  2. dlinstedt 2010/08/07 at 4:50 am #


    Yes, I will blog on this concept of loading the Data Vault to give you some ideas as to how all this works. There is so much material on this concept that I’ve put many of the how-to’s, the steps, the concepts and the videos (including tool use and demonstrations) inside the coaching area. By having it in the coaching area, I can include downloads, templates, best practices – and I can garauntee all your questions are answered appropriately.

    I will give you some insight though by blogging on the topics here. Next year I am starting my next book: How to implement a Data Vault in Two Weeks or Less. I should have my technical modeling book available before December of this year.

    Dan Linstedt

  3. Apex 2010/08/14 at 7:16 am #

    >I should have my technical modeling book available before December of this year.

    It’s just great news! I’m looking forward to read it:)

Leave a Reply