Data Science

Generating #datavault models & Issues to address

hello everyone, i’ve been around the world and back – and i’ve seen and heard about many different kinds of automation tools for data vault systems out there.  in fact, i partner and endorse one of them: mapping manager.  but something is consistently bothering me, and in this blog i will uncover the issues (as i see it) with data vault model generators.


there is a need for automation, and generation, and from a perspective of auto-generating processing code, no one does it better than my friends at with their tool mapping manager, and their specific catfx (code automation templates).  there are plenty of other tools out there that also generate code including: wherescape, and quipu just to name a few.

anyhow, over the past two years, i’ve heard about “tool x and tool y” that supposedly generate data vault models.  well, let me tell you – there are not a lot of tool vendors that have stopped to understand just what they are really claiming.  or, the other case: they claim to generate data vault models when in reality, all they truly do is use primary and foreign keys to re-structure source system data models (making them look like a data vault model).

don’t get me wrong, with a bit of mathematical skills, understanding circular relationships, and multi-tennent pk attributes and fk attributes, the end result *might* be ok, but it certainly is no-where near what i would call a raw data vault model fully vetted with even it’s purest of intentions.  what these tools typically produce, is what ronald damhof and i wrote about years ago called a: source system vault.

what’s the difference?

there are quite a few differences, but mostly at the end of the day it’s the lack of focus that these “generated models” have on answering business key consolidation that has me worried.  i’ve worked for years on this problem in my own software (written model generation and consolidation software for over 15 years) and even i still don’t have it completely solved.  but with that, let me just say this – here is a short list of issues that these tools (usually) don’t even begin to address.

if or when they do, they will need ph.d. knowledge in math, and semantic studies to make it work right.

  • multi-source model consolidation
  • business key focus
  • relationship folding (where appropriate)
  • ontology and taxonomy aware decisions
  • semantic lookups and references
  • incorporation of business terms & glossaries
  • cyclical relationship resolution

that doesn’t even begin to address the mathematics side of the house, where we need to apply things like:

  • dijkstra algorithms
  • shortest path & longest path algorithms
  • heaviest weight & most meaningful node decisions
  • k-means clustering
  • probability distribution
  • confidence and strength ratings
  • association rules
  • backpropogation

and more!

now, all that to say i don’t have a mathematics degree, or even a ph.d. in any of these things, but i know after 15 years of study, writing code, applying different algorithms, that these are not easy problems to solve.  i have some of the answers and quite a bit of vision in this area, but i can say without a shadow of a doubt, that nearly every tool i’ve reviewed in the past 10 years (who claim to generate data vault models) – don’t do it right.

please, don’t take this to mean that the automation / generation tools aren’t good…  the ones that generate elt / etl and big data data integration code are awesome, and spot on!  what i’m talking about here are the tools that claim to generate data vault models.

what should i do with this knowledge?

well, good question.  today the absolute best way to construct your data vault model is to do it with your team.  engage in discussions all about the business, focus on the business keys and the consolidation of business keys horizontally across lines of business.  you and your team will come up with a far better result than simply taking something that “generates” based on primary key (pk) and foreign key (fk), much less a tool that uses only one source model.

what’s the danger of using a tool to generate a data vault model?

you end up with

  1. too many data vault tables,
  2. a source system data vault model that fails to account for multi-system integration,
  3. disparate and duplicated / unnecessary hubs links and satellites,
  4. lack of an integration across multiple systems,
  5. silo’d solutions and unmatched answers out of the marts on the end of the system.


sure, there are always conclusions.  for generating etl, elt, and big data solutions stick with my friends at and their tool: mapping manager.  for data vault model generation: go your own way, do it by hand.  learn how to properly consolidate business keys, business ontologies and taxonomies – study the business metadata, and learn how to fold structures together with passive integration.  today, the human mind is far more powerful than the machine when it comes to the context decisions of metadata.

as always, you can contact me for data model reviews, guidance, assessments and the like.  you can see a description of the packages we offer here:

hope this helps,

dan linstedt

Tags: , , , , , , , ,

No comments yet.

Leave a Reply