in recent times, there have been several discussions around the standards of the data vault modeling components. in fact, this isn’t the first time (nor the last) that i expect the standards to be challenged. that said, i feel it necessary to discuss just what i put the standards through before publishing, in hopes that if you feel the urge to suggest changes, you can apply the same rigor as i do before telling the user group that “i’m wrong, the standards don’t work and here’s a change”. this is a post in to the reasoning behind, the insight and the rigor applied to the standards (all standards) for the data vault 1.0 model, and data vault 2.0 model.
scope of post
this post is highly focused on the process for building and accepting standards around the data vault model. this post does not include any standards processes for methodology, implementation, nor architecture – those are separate.
the standards of the data vault model (both 1.0 and 2.0) are tested in rigorous situations. many questions are asked, and many challenges are put forward against the structures, the design, the attributes, and their definitions. all in the name of producing a standard that overall achieves business value, and achieves it value. the hope is that the standards will play a pivotal role in getting the data vault models built consistently, appropriately, and repeatably (following pattern based design) and steer the project successfully around potential pot-holes that so often cause re-design and re-engineering to happen in the market place known as data warehousing.
the background of the current standards
before launching in to a short discussion on how the standards are constructed, i want to share with you why the current standards are, the way they are today.
i spent 10 years from 1990 to 2000 (long before even publishing in 2001) researching, testing, designing, and re-designing the base standards around the model, the methodology, the architecture, and implementation. however since the focus of this post is the data vault model, i will stick to those portions of this discussion.
in those 10 years, i tested what was then “big data” (volume and velocity, and variety), internal and external feeds, web-service provided data sets, ftp, cobol and mainframe extracts, relational extracts, and yes, even document extracts. by tested, i mean: i put these types of data in to the design, and tried intentionally to break the system, ie: break the etl, break the elt, break the bulk-load, break the database, break the staging area, break the data model, break the relationships, break the keys, break the user-defined data set, break the temporality (ie: load dates), and so on.
you see, i have spent part of my previous life (prior to data warehousing) in quality assurance as a lead qa role for a compiler company. if i failed my testing job, or the tests i wrote didn’t provide enough coverage, then chances are the compiler would fail in the field – not a pretty situation, especially since compilers by nature are meant to have a very low tolerance for failure. in other words, not only was i trained how to write test cases, but i was responsible for sign-off as lead in the q&a department before the compiler would be approved for release to the end-customers. all this to say, i had learned a thing or two about regression, white-box, black-box, and clear box tests.
now the data vault model needed testing, as did the load routines and everything else i mentioned a minute ago. unfortunately i would have to write a full book in order to explain all the tests i put the model through, and i don’t have enough time for that. so, in short, i will attempt to explain the base level questions that the components and the attributes must meet in order to qualify to be part of the modeling paradigm.
if you can put your own “suggestions for changes to the standards of data vault modeling” through these questions, and test appropriately – proving results, then i would be happy to examine the results, and quite possibly warrant a change in standards. it’s how i got to data vault 2.0 modeling changes in the first place.
questions to ask of “proposed” standards changes
soon, i will write a white paper detailing just the load date, the load end date, their purpose, and their necessity. from that perspective, i will dive in to details about the specific element known as load date – why it is, what it is, what it is not, and why it should never ever be changed. but for now, if you have a desire or a need to suggest changes to the entire user community around data vault modeling, then please be aware of the following questions, as sooner or later i will ask you for the results of these questions – or i will challenge your assertions as you challenge mine.
i expect nothing less of those of you wanting to change the standards, at the end of the day, if the “new proposal” you make to the entire user group fails any one of these challenges, then you are putting the future of the entire data vault model at risk of failure (ie: re-design or re-engineering), and putting the entire effort of those who listen to your suggestion at risk of failure due to a break in the consistency and applicability of your suggested change.
here are the questions you will want to test / ask: * all should work without if rules
- does the new proposed standard work with external data
- does the new proposed standard work with real-time data
- will the new proposed standard work with unstructured data?
- will the new proposed standard work with structured or multi or semi-structured data?
- will the new proposed standard cause cascading change impacts in the model?
- does the new proposed standard introduce temporal (single or multi) confusion?
- will the new proposed standard introduce more foreign keys?
- does the new proposed standard fit in the business vault or the raw data vault?
- is the benefit of the new proposed standard business or technically driven (or both)?
- will the new standard work at 1 row and at 100 billion rows (in the same table structure) without forcing re-design?
- does the new proposed standard inhibit high speed loading?
- does the new proposed standard introduce dependencies in the loading process or remove them?
- is the new proposed standard a physical improvement for performance? or a logical change for business value, or does it fit in both physical and logical designs?
- does the new proposed standard improve query speed over 100 million rows (in a single table) or does it cause performance problems in this case?
- does the new proposed standard work for any feed from any source system at any time? or does it *require* certain technology be in place up-stream of the data vault in order to work?
- does the new proposed standard introduce cartesian products on queries? if so, how, why?
in reality, there are about 100 more questions for testing that i introduce that get you in to the weeds. very laborious indeed. these are just some of the fundamental questions that i put the modeling techniques through before announcing them as part of the standard. it’s also why data vault 2.0 “stayed in the labs” for at least 2 years before being fully released to the public. testing, testing, and more testing.
remember: any standard you suggest to the world wide user group can cause grave damage if it doesn’t work at volume, velocity, variety – or requires certain software or hardware in order to be executed properly.
i put the original data vault 1.0 model through these and many more tests, in practice, and in production as well. i don’t just come at data vault with a theoretical approach. everything i teach, everything i espouse, everything i build, everything i talk about, write about, or standardize i have put in to practice in production systems (usually at large scale with lots of active data and big teams).
i encourage you, if you feel the need to “tell users to change the defined standard” to prove to the user group that you’ve answered these questions, explain where, why, and how you’ve accomplished these tests, and what the test results truly are. because if you don’t – i will challenge you to bring answers, because you run the risk of suggesting breaks or “workarounds” that ultimately fail in the consumers eyes.
now that said, i am constantly challenged by new customers and new cases. i am very happy to say that dv2.0 is the first time in 14 years after “production” that a the modeling standards have changed – due to a performance break in high volume, and dependencies not-sustainable at extreme volume across multi-platform loads. the break? sequence numbers.. yep, the very suggestion that “improves query performance” at smaller volumes, has a re-design consequence at massive volume systems (something we could not foresee, and could not test until recently with the advent of hadoop).
hence the one and only change to the data vault 2.0 model – the replacement of sequence numbers with hash keys. mind you there is a specific set of rules and standards around how this works, and i’ve written an 8 page white-paper that all my students receive when they go through my courses (in person or on-line).
so in closing, all i ask is if you are going to suggest changes to the modeling paradigm and standards, that you please test your suggestions before making blanket statements in public about how the standards i propose are wrong or don’t work.
and remember – load dates will be covered in a white-paper, due out soon. you will get an in-depth look at how to test a single element for use within the standards of the data vault modeling paradigm.
with best regards,