for many years, i have built, authored and maintained the #datavault standards. this includes data vault 1.0, and data vault 2.0. there are others in the community who believe that “these standards should evolve and be changed by consensus of the general public”.
i have a number of issues with this approach. in this article i will describe what it takes to author a standard for the data vault 2.0 system of business intelligence. you are certainly more than welcome to contribute to the standards body of knowledge around data vault, i simply want contributions to be held to the highest level of integrity.
why people insist on “breaking the rules and standards” i set forth is beyond me. would you trust a heart surgeon whom has never been to school for proper training (standard methods and procedures) for operating on your heart? how about a brain surgeon? well, of course it goes without saying that when your life depends on it (whatever it is, from a car functioning properly in a crash, to an airplane flying according to the laws of physics) that all of the sudden: good standards matter.
now with all sets of standards there are the purist standards (those that i document), and the pragmatic standards (those that contain minor alterations or deviations). now, the bigger the gap between purist standards and pragmatic standards, the more likely the project / process / design will fail under stress.
the issue isn’t necessarily the alteration itself, it’s the lack of rigor applied to testing the pragmatic approach and alteration proposed that eventually results in failure.
there are some cases, on specific projects where i have vetted and approved minor alterations for a pragmatic approach to implementing data vault. one such case is for teradata. the way the relational engine works, a hash key is not necessary, oh by the way, neither is a surrogate sequence identifier! teradata can and does hash it’s primary key / business key under the covers. this is an optimization not made by most other platforms (except sap hana).
most of the time however, the standards as i have defined them, must stay in place – or some part of the architecture, methodology, model, or implementation will suffer (in some cases, multiple parts will suffer). then come my competitors whom i originally taught data vault 1.0 to. they make claims as they see fit. i’ve made a list of some of their false claims below:
poor judgement claims made in the market:
- a link can be a hub
- a hub can have a foreign key to another hub
- a satellite can have it’s own sequence id / primary key identifier
- sequences are fine to continue utilizing, you don’t need hash keys
- standards for data vault should be managed by consensus, and by the community at large
- satellites can have more than one foreign key to more than one parent structure
- you don’t need change data capture
- data vault 2.0 is nothing more than a change from sequences to hash keys in the modeling level
and more! some are far too outlandish to list here, they would simply provide a good laugh.
want to suggest a change to the standards?
i am not saying you cannot suggest changes. i have always kept my door open (and continue to do so). i welcome suggestions, and thoughts around how the standards can evolve to better suit the needs of the market place, automation, big data, and so on. in fact, it was with a team of individuals that i collaborated with in order to innovate data vault 2.0 in the first place. this team included: kent graziano, michael olschimke, sanjay pande, bill inmon, gabor gollnhofer, and a few others…
i didn’t make sweeping changes by myself, or just because i thought it would be a good idea, no – i tested (and tested and tested), and vetted the ideas with my colleagues before announcing (about 2.5 years later) the data vault 2.0 system of business intelligence.
i am more than happy to have you suggest changes, or to hear your ideas. standards do need to evolve, change, adapt (hopefully without causing re-engineering effots). that said, i expect you to apply proper rigor before making suggestions. below are a list of conditions i expect you to run your changes through, and bring documented results of – before i can consider the change to the greater standard.
- test against large volumes of data (these days it must be > 500tb of data) this number will continue to increase as systems are capable of handling larger data sets.
- test against real-time feeds (burst rates of up to 400k transactions per second). this number too, will continue to increase as systems are capable of handling larger data sets.
- test against change data capture and restartability
- test against multiple platforms, including (but not limited to) oracle, sqlserver, db2, teradata, mysql, hadoop (hdfs and hive and spark), cloudera, mapr, hortonworks, snowflakedb.
- test in multiple coding languages: python, ruby, rails, java, c, c#, c++, perl, sql, php (to name a few)
- test in recovery situations: restore, and backup
below are a list (sample list) of questions i typically ask of the change: (i track, record metrics around these items)
- does it negatively impact the agility or productivity of the team?
- can it be automated for 98% or better of all cases put forward?
- is it repeatable?
- is it consistent?
- is it restartable without massive impact? (when it comes to workflow processes)
- is it cross-platform? does it work regardless of platform implementation?
- can it be defined once and used many times? (goes back to repeatability)
- is it easy to understand and document? (if not, it will never be maintainable, repeatable, or even automatable)
- does it scale without re-engineering? (for example: can the same pattern work for 10 records, as well as 100 billion records without change?)
- does it handle alterations / iterations with little to no re-engineering?
- can this “model” be found in nature? (model might be process, data, design, method, or otherwise, nature – means reality, beyond the digital realm)
- is it partitionable? shardable?
- does it adhere to mpp mathematics and data distribution?
- does it adhere to set logic mathematics?
- can it be measured by kpi’s?
- is the process / data / method auditable? if not, what’s required to make it auditable?
- does it promote / provide a basis for parallel independent teams?
- can it be deployed globally?
- can it work on hybrid platforms seamlessly?
and quite a few more. there are those out there who say: volume and velocity don’t matter… well i beg to differ. volume and velocity (data moving within a fixed time window from point to point) cause architectures, models, and processes to fail – having to be re-engineered at the end of the day. unless you’ve had this level of exposure (today at the 400tb + level) you would never have this experience.
if volume and velocity did not matter, we never would have seen the creation of hadoop and nosql in the first place.
i welcome suggestions to changing the standards – all i ask is that you put the proper rigor and testing behind the changes first. one-off cases or one-time changes do not work and will never be accepted as changes to the core standards. just a refresher: i have put in 30,000 test cases between 1990 and 2001, and another 10,000 test cases since then in order to build common standards that everyone can use, and create successes in your organization.
with the advent of data vault 2.0 i have (finally) included the necessary documentation for the methodology, architecture and implementation. i’ve enhanced the modeling components to meet the needs of big data, hybrid solutions, geographically split solutions, privacy and country regulations. the changes to the data modeling paradigm (while subtle) are important.
i did not build these standards by myself in a closet somewhere. i had a team of 5 people at lockheed martin every step of the way, and no – my current competitors were not part of that team. in fact, they didn’t even know data vault existed at that time, because it was still under development between 1990 and 2001. that team consisted of: myself, jack, arlen, jackie, and john. all of whom worked for lockheed martin. i have reserved their last names to protect their privacy.
please note: i have just released the new data vault data modeling standard v2.0.1 free for you. you can get a copy of it by registering for http://datavaultalliance.com
coming soon: data vault implementation standard v2.0.0, and a few more!!
have something negative or positive to say?
post a comment below, happy to hear from you directly
(c) copyright dan linstedt, 2018 all rights reserved.