ETL and the Data Vault

sneak peek at implementation components (something i’m starting a 3rd book on – yes i know, i have to get the 2nd book about the modeling published – it’s coming i promise!)   nearly every etl or elt tool works well with the data vault.  why is this?  it is because of the etl/elt tools’ ability to run based on patterns.  the data vault model is based on patterns of design (hub, link, satellite) along with very specific rules for loading.  these rules + patterns make it very easy to construct etl/elt load routines.

idea #1: patterns make for good optimization

having patterns allows a minimum of variance (in design of the load) for each table type.  for instance, there is really only 1 way you can load a hub – there may be 3 ultimate mechanisms, but 1 most efficient way – and depending on the tool that may vary.  especially if you have push-down (or elt – in database sql).  so, in light of this pattern – it is easy to spot the flaws in the architecture early on, correct them, and offer solid solutions for high-performance through parallelism.  this goes for satellite, and link loads too.

idea #2: patterns make for good automation

having patterns provides great foundations for automating the workflow.  particularly when each “server” running the etl/elt engine has different configuration specifications.  the workflow must remain flexible while the loading routines remain strong and balanced.  it is _very_ easy to automate the loading processes to the staging area, and to the data vault.  it is because of idea #1, and the base rules and patterns of specific structures that make it this way.  automation is simple, and yet dynamic.  increase the server’s hardware, and voila – you can easily change the automated loading pattern without touching the underlying code.

idea #3: patterns drive repeatability and good documentation

with the patterns established, there are only so many ways you can/should load a hub, load a link, and load a satellite.  because these are design patterns, it makes the task of creating loading cycles redundant and mundain.  but at the same time – the load processes should follow a single document, meaning that every hub load follows one “process diagram” and is documented one time, every link load follows a single process diagram and is documented one time, and nearly every satellite load follows a single process diagram and is documented one time.  this allows you to focus on what you need to do to tweak and customize 20% of the loading processes (should be closer to 10%).  this means that 80% or 90% of your loading processes are fully documented, easily understood, and easily optimized because they are repeatable in their design pattern.  it also means ramp-up-time for a new resource is very very short.  they say people learn by example, well – with 100 tables or 1,000 tables in the data vault, you have only 3 loading patterns to teach, and hundreds of examples for them to play with.

idea #4: repeatable design patterns are easily made restartable

what is built in to the loading patterns (which is one of the concepts we teach in class), is restartability.  the idea that once the problem is fixed, it’s a simple matter of “restarting” the loading process from where it left off.  all loading processes going in to the data vault are designed in such a way that they *do not* load duplicates, can pick up where they left off, and *do not* need code changes to make them restartable in nature.

idea #5: design patterns drive high performance

the design patterns used in loading the data vault have already been tuned to provide the maximum performing architecture.  sure, you have to add physical components like partitioning, parallelism, cache management, and database tuning – but those always need to happen.  in this case, the pattern is already created for *optimal* performance, so that when you partition – you don’t have to change the code to get it to perform.  when you add parallelism, or database tuning, the “data mapping” automatically inherits the performance benefits.   too often, when we think of data warehouse performance and tuning in the etl world, we end up “changing” the code in the etl layers, and what happens is: we gain performance for one load, but when the data changes – we lose performance later.  with the design patterns of the data vault, we gain optimized performance out of the gate.

idea #6: design patterns make for easy “generation” of loading processes

as a part of sei/cmmi level 5, automation and optimization bubble to the top.  why should etl be any different?  why should we spend/waste our time hand-coding 100% customized loading processes all the time?  we shouldn’t!  with the data vault design patterns we don’t have to!   we can actually construct a load-process generator for our current etl tool, and generate 80% to 90% of the loading processes – as long as we have a data lineage mapping from the staging area to the data vault, this is what makes this work.  i have many customers generating loading code for owb, ssis, informatica, pentaho, talend, and so-on.  this leaves us time to build the real-work/real-customization where we have the most impact, and focus on the business rules (moving data from the data vault to the star schemas) – which by the way can be base-line generated so we have a starting point.

there are more, but i hope this helps get you started.  please let me know what you want to hear about.

dan l

Tags: , , , , , , , ,

No comments yet.

Leave a Reply