the data vault model is full of patterns; deterministic, finite patterns. one of the benefits and objectives of this is to allow the automatic generation of loading routines (etl / elt, sql, stored procedures etc…) in this entry i dive a little deeper into the pattern side of loading the data vault. what drives it, why it is architected this way, how it works – and how easy/hard it might be to setup your own “generation of code” for loading patterns.
what drives the patterns? why is it architected this way?
it really comes down to efficiency. when building the dv architecture and rules, it seemed like a good idea to apply design patterns. i was knee deep in sei/cmmi level 5 at the time, and we were all about business process optimization. i figured that one way to optimize the it business, was to ensure that the data warehouse data model could be built in a repetitive fashion. so, the data vault model follows strict rules which allow it to adhere to common and consistent patterns. the other driver for the model was quite simply:
i was tired of “building” mundane data flows, over and over and over again. you see, i’m a reductionist at heart (when it comes to implementation, long-winded when it comes to blogging)… quite simply put, i had no time or money to “hire” new resources to do the grunt work for me, and we had a deadline looming. i placed my hope in the “patterns” and rule-sets that govern the data vault model, and built my first etl code generator.
what do i need to understand in order to write my own generator?
first, be able to “classify” or categorize data models. for instance: are they 3nf (normalized – sort of), staging (separate/distinct tables mostly), data vault (hub, link, satellite), star schema (dimension, degenerate dimension, fact). once the model classification has been done, then table classification needs to be done (as i just mentioned, determining table type is important). why? because you will need one kind of code to load “dimensions”, while you need a different kind of code to load “facts”. but in the end, all dimensions should load the same way, while all facts should load the same way.
in other words, pattern based designgs – or template driven code generation.
the data vault loading patterns are grouped as follows:
- database to database
- flat file to database
then, they are sub-divided into additional patterns:
- source to stage
- source to hub
- source to link
- source to satellite
- stage to hub
- stage to link
- stage to satellite
and so on. understanding these patterns yields a good chance of designing an appropriate template to match, from there it’s also a matter of understanding or knowing what the properties are of the target table.
so what’s the secret?
there really isn’t one, but if you need it: it’s a rule of thumb i learned long ago in performance and tuning across many different etl tools (running 1.1 billion rows through a single process at high speeds…. etc…) the secret as it were is: design your etl – 1 target per data flow per action. where action may be insert, update, delete – and the 1 target – well that speaks for itself.
not only will this simplify your data loading paradigms, it will reveal the easy to pattern loading routines – it has the added benefit of a) being restartable – when designed properly, and b) being high performant out of the gate, and c) easily parallelized or shared over a grid (partitioned).
so, discover the patterns for loading the data vault (i’ve shared some here), and follow the basic rules with the “secret” mentioned above, and you should have major success.
please feel free to comment on what’s here – offer your insights as it were.
hope this helps,