Notepad with Systems Thinking on a wooden surface.

Want to change or add a #DataVault Standard?

for many years, i have built, authored and maintained the #datavault standards.  this includes data vault 1.0, and data vault 2.0.  there are others in the community who believe that “these standards should evolve and be changed by consensus of the general public”.

i have a number of issues with this approach.  in this article i will describe what it takes to author a standard for the data vault 2.0 system of business intelligence.  you are certainly more than welcome to contribute to the standards body of knowledge around data vault, i simply want contributions to be held to the highest level of integrity.


why people insist on “breaking the rules and standards” i set forth is beyond me.  would you trust a heart surgeon whom has never been to school for proper training (standard methods and procedures) for operating on your heart?  how about a brain surgeon?  well, of course it goes without saying that when your life depends on it (whatever it is, from a car functioning properly in a crash, to an airplane flying according to the laws of physics) that all of the sudden: good standards matter.

now with all sets of standards there are the purist standards (those that i document), and the pragmatic standards (those that contain minor alterations or deviations).  now, the bigger the gap between purist standards and pragmatic standards, the more likely the project / process / design will fail under stress.

the issue isn’t necessarily the alteration itself, it’s the lack of rigor applied to testing the pragmatic approach and alteration proposed that eventually results in failure.

there are some cases, on specific projects where i have vetted and approved minor alterations for a pragmatic approach to implementing data vault.  one such case is for teradata.  the way the relational engine works, a hash key is not necessary, oh by the way, neither is a surrogate sequence identifier!  teradata can and does hash it’s primary key / business key under the covers.  this is an optimization not made by most other platforms (except sap hana).

most of the time however, the standards as i have defined them, must stay in place – or some part of the architecture, methodology, model, or implementation will suffer (in some cases, multiple parts will suffer).  then come my competitors whom i originally taught data vault 1.0 to.  they make claims as they see fit.  i’ve made a list of some of their false claims below:

poor judgement claims made in the market:

  • a link can be a hub
  • a hub can have a foreign key to another hub
  • a satellite can have it’s own sequence id / primary key identifier
  • sequences are fine to continue utilizing, you don’t need hash keys
  • standards for data vault should be managed by consensus, and by the community at large
  • satellites can have more than one foreign key to more than one parent structure
  • you don’t need change data capture
  • data vault 2.0 is nothing more than a change  from sequences to hash keys in the modeling level

and more!  some are far too outlandish to list here, they would simply provide a good laugh.

want to suggest a change to the standards?

i am not saying you cannot suggest changes.  i have always kept my door open (and continue to do so).  i welcome suggestions, and thoughts around how the standards can evolve to better suit the needs of the market place, automation, big data, and so on.  in fact, it was with a team of individuals that i collaborated with in order to innovate data vault 2.0 in the first place.  this team included: kent graziano, michael olschimke, sanjay pande, bill inmon, gabor gollnhofer, and a few others…

i didn’t make sweeping changes by myself, or just because i thought it would be a good idea, no – i tested (and tested and tested), and vetted the ideas with my colleagues before announcing (about 2.5 years later) the data vault 2.0 system of business intelligence.

i am more than happy to have you suggest changes, or to hear your ideas.  standards do need to evolve, change, adapt (hopefully without causing re-engineering effots). that said, i expect you to apply proper rigor before making suggestions.  below are a list of conditions i expect you to run your changes through, and bring documented results of – before i can consider the change to the greater standard.

  1. test against large volumes of data (these days it must be > 500tb of data)  this number will continue to increase as systems are capable of handling larger data sets.
  2. test against real-time feeds (burst rates of up to 400k transactions per second).  this number too, will continue to increase as systems are capable of handling larger data sets.
  3. test against change data capture and restartability
  4. test against multiple platforms, including (but not limited to) oracle, sqlserver, db2, teradata, mysql, hadoop (hdfs and hive and spark), cloudera, mapr, hortonworks, snowflakedb.
  5. test in multiple coding languages: python, ruby, rails, java, c, c#, c++, perl, sql, php (to name a few)
  6. test in recovery situations: restore, and backup

below are a list (sample list) of questions i typically ask of the change: (i track, record metrics around these items)

  1. does it negatively impact the agility or productivity of the team?
  2. can it be automated for 98% or better of all cases put forward?
  3. is it repeatable?
  4. is it consistent?
  5. is it restartable without massive impact? (when it comes to workflow processes)
  6. is it cross-platform?  does it work regardless of platform implementation?
  7. can it be defined once and used many times? (goes back to repeatability)
  8. is it easy to understand and document?  (if not, it will never be maintainable, repeatable, or even automatable)
  9. does it scale without re-engineering? (for example: can the same pattern work for 10 records, as well as 100 billion records without change?)
  10. does it handle alterations / iterations with little to no re-engineering?
  11. can this “model” be found in nature?  (model might be process, data, design, method, or otherwise, nature – means reality, beyond the digital realm)
  12. is it partitionable?  shardable?
  13. does it adhere to mpp mathematics and data distribution?
  14. does it adhere to set logic mathematics?
  15. can it be measured by kpi’s?
  16. is the process / data / method auditable?  if not, what’s required to make it auditable?
  17. does it promote / provide a basis for parallel independent teams?
  18. can it be deployed globally?
  19. can it work on hybrid platforms seamlessly?

and quite a few more.  there are those out there who say: volume and velocity don’t matter…  well i beg to differ.  volume and velocity (data moving within a fixed time window from point to point) cause architectures, models, and processes to fail – having to be re-engineered at the end of the day.  unless you’ve had this level of exposure (today at the 400tb + level) you would never have this experience.

if volume and velocity did not matter, we never would have seen the creation of hadoop and nosql in the first place.

in conclusion…

i welcome suggestions to changing the standards – all i ask is that you put the proper rigor and testing behind the changes first.  one-off cases or one-time changes do not work and will never be accepted as changes to the core standards.  just a refresher: i have put in 30,000 test cases between 1990 and 2001, and another 10,000 test cases since then in order to build common standards that everyone can use, and create successes in your organization.

with the advent of data vault 2.0 i have (finally) included the necessary documentation for the methodology, architecture and implementation.  i’ve enhanced the modeling components to meet the needs of big data, hybrid solutions, geographically split solutions, privacy and country regulations.  the changes to the data modeling paradigm (while subtle) are important.

i did not build these standards by myself in a closet somewhere.  i had a team of 5 people at lockheed martin every step of the way, and no – my current competitors were not part of that team.  in fact, they didn’t even know data vault existed at that time, because it was still under development between 1990 and 2001.  that team consisted of: myself, jack, arlen, jackie, and john.  all of whom worked for lockheed martin.   i have reserved their last names to protect their privacy.

please note: i have just released the new data vault data modeling standard v2.0.1 free for you.  you can get a copy of it by registering for

coming soon: data vault implementation standard v2.0.0, and a few more!!

have something negative or positive to say?

post a comment below, happy to hear from you directly
(c) copyright dan linstedt, 2018 all rights reserved.

Tags: , , , , , ,

10 Responses to “Want to change or add a #DataVault Standard?”

  1. James Snape 2018/06/04 at 4:50 am #

    “How about a brain surgeon?”

    Interesting analogy. The key point for medical standards is they are researched, as you mention, but also published so others can try and replicate their outcomes. Its all part of the scientific process. I don’t see a lot of data vault research being published.


    p.s. Medical policies are also decided by the medical community.

  2. Michal 2018/06/04 at 6:05 am #

    Hi Dan,

    The figures and requirements seem big, but probably valid.
    If one can prove things work at that scale it is probably worth considering.

    But what does sound like a value to community around and some way to back our claims it is as great as we all think is disclose some of these test cases that you have made for DV.
    Are these just textual cases or something that could be scripted and shown to technical and business people?


  3. Dan Linstedt 2018/06/05 at 4:07 pm #

    True, however it’s a viable medical community with training on the standard operating procedures. Yes you are right, research is published. Part of my new initiative for DataVaultAlliance is to provide the platform to publish the research (and not just from my position) but from everyone who builds.

    Every year at my conference we have independent customers, consultants, and more “publishing” the results of their research and implementations. I ensure that the presenters bring their “warts and all” – in other words: their successes, their current struggles, and failures if they have them. We share, network, and all work together to propose new thoughts, ideas, and so on. This has been going on for the past 6 years, and will only grow. Next year we are bringing WWDVC to Germany, as well as the US (again). I record all the presentations, and every year I offer them on-line for folks to watch and learn from.

    Medical “policies” may be decided by the medical community, however “Standard Operating Procedures” are fairly stable, and are decided only by a small group of officials at the top. Why? because peoples’ lives are at stake.

    It’s also very difficult to take all of the 10 years of research and design that was done at Lockheed Martin for the NSA and Dept of Defense, and get it published. Mostly because Lockheed owns that intellectual property, and after 10 years, I had binders and binders and binders full of paperwork. That said, I am doing my best to get permission to release new research, as well as old research – in the form of standards and so on. It’s coming, it is part of the value proposition of

    Thank you for your feedback, it’s very good to hear.

  4. Dan Linstedt 2018/06/05 at 4:11 pm #

    Yes, there are a lot of interesting developments in automation around test cases, test processes, test procedures (standardized, repeatable, methodological) for Data Vault. They are based on the patterns that exist (always have and always will be). I am working on getting that intellectual property ready for release, but that is going to take quite a bit of effort. As far as publishing the test cases I documented at Lockheed Martin, sadly I am not allowed – as that is the intellectual property of their company. What I can do is publish the testing methodology, the processes, and the standards for developing test cases.

    It’s coming. By the way, I’ve been working with Intact Financial in Canada for the past 5 years, and helped them build their testing harnesses, standards, and procedures over that time. They have presented this year at WWDVC (the video will be available shortly) and discussed their current metrics about time savings, procedures, and best practices.

    Thank you for your valuable feedback, Dan

  5. Maja Ferle 2018/06/14 at 2:30 pm #

    Dear Dan,
    In the new Data Vault Data Modeling Standard v2.0.1 I don’t see any mention of Dependent Child Key in links, although this topic is covered in the book (section Is this no longer part of the standard and if so, what is the alternative to ensure that a link represents a unique list of relationships?
    Thanks & regards,

  6. Dimitri 2018/07/18 at 8:36 am #

    Hi Dan,

    Do you think using temporal tables (in SQL Server 2017) for a satellite table is a good idea ?

    This new SQL Server feature allows to keep a full history of data changes.

    Please find below a link to a full description of temporal tables in SQL Server :

    Thanks and regards,


  7. Dan Linstedt 2018/07/18 at 1:38 pm #

    Hi Dimitri,

    Thank you for the question. There are pros and cons to letting the database manage temporality and history for you.

    * you no longer have to worry about “end-dating” a record, IF you really think you need end-dates in your queries
    * you don’t have to use Load-Date-Time Stamps, or Load-End-Dates (ie: load end dates are dead anyway)

    * Each platform manages temporality differently, some platforms don’t even have support for this, so if you ever end up switching platforms, you will have to “re-engineer” the entire process to make sure it still works on the new platform.
    * There is no mention of the following: What if, I issue an UPDATE, and NOTHING (not a single column value) in the record has changed? Does it still write to the history a COPY of the exact same row? If it does, then there is no delta checking – which means you still have to manage your own delta processing (which is what I expect to do after reading the documentation)
    * OVERLOADING the meaning of the “END DATE”. If you read carefully, it states that “DELETING a record” will move the record to history, and set the end-date. This causes a problem: NOW you have to query for records that exist ONLY in history, but not in the current, in order to find out “what records” are deleted. I would much rather have the engine literally capture a third system driven temporality attribute: DELETED_DATE. Why vendors “don’t get this” is beyond me….
    * No mention of performance over volume. Can this table be loaded in parallel? will it work with 500 million record loads in a reasonable time? What happens with a real-time stream and two queues wanting to “update” the same record? Currently there are no apparent answers to these questions on the forum. I would encourage you to go back to Microsoft and ask them to demonstrate these features.
    * No mention of what happens when: T+2 – you DELETE a record, then, 6 months later (T+6months) you INSERT a new one for the same key… WHAT does the engine do? Does it correctly “change” the end-date of the old record in history? does it insert a new record in to history to show “how long” the old record was “deleted” for??? there are no clear documentation aspects as to the workflow for this.

    It seems to me that the vendor (Microsoft) may not have thought this entire thing through… I’ve seen this before – with Teradata and Oracle, both having implemented “temporal options” in their database without fully defining their best practices, methodology, and without fully testing at scale, and demonstrating exactly how to use these features effectively.

    So all of that said: Feel free to take a look and try to leverage it, I would be curious what needs to be defined to answer my questions above. Let us know how it turns out.

    On the whole – yes, it CAN be used for Satellites, however – IS IT A GOOD IDEA??? That, remains to be seen.

    Hope this helps,

  8. Angad Singh 2018/07/24 at 9:52 am #

    Hello Dan,
    First of all Thanks a lot for your new DV 2.0.2 Standard. We already had DV 2.0, but I think this insert-only DV is far more fast and agile than before.

    After reading the Spec for DV2.0.2, I was really puzzled with loading of the SATs. The first SAT rule (6.1) says “A Satellite can never maintain or create its’ own surrogate sequence key or its’ own Hash key”.

    Now as per your book (Building a scalable DV 2.0), you do it using “All Column Matching” approach or “Hash Diff” approach.

    So isnt this Hash-Diff same as Satellite-Hash-Key per definition? Should we have it in 2.0.2 or not?
    Would really help if you could specify a Load Pattern for 2.0.2 SAT.

    Thanks and Regards


  9. Dan Linstedt 2018/07/30 at 9:05 am #

    The hash difference should never ever be used as the Satellite primary key. It should only be utilized for difference checking. Furthermore the Hash Difference is optional, in case you choose instead to compare all columns.

    Hope this helps,

  10. Angad 2019/01/29 at 5:14 am #

    HI Dan,

    How about if we have a OPS Table where we can store the following fields
    1. BK_HASH
    2. HASH_NAME
    3. RSRC
    4. LDTS
    5. LAST_SEEN

    This should be a generic table and can be used to store all the hashes with every load. It should be INSERT only, storing when the HASH was last seen. We can attach this table to all HUBs and filter based on respective HASH_NAME.
    select * from HUB_CUSTOMER h
    inner join LKP_OPS_TBL l on (h.CUST_HSH = l.BK_HSH)
    where l.HASH_NAME = ‘CUST_HSH’

    This will ensure that we only see those records which are delivered today. Maybe it should be a REF table or LOOKUP table.

    What do you think?

Leave a Reply