even after my last post about standards, rules, procedures there are still a number of people in the industry who believe in “modifying the structure” of the data vault, breaking the rules and standards, and thinking that it’s ok to do this and still call it a data vault. in this post, i will cover why the fundamentals standards of the data vault are important, and if you believe that you have a modification that will make the data vault better, then i will explain the process that i expect you to go through (publicly) in order to demonstrate to me that it is worth taking a look at.
where did it all begin?
it all began in 1990 when i noticed there were no documented best practices, no rules, no standards for building, architecting, and implementing enterprise data warehouses. i knew from experience, that bending and changing existing rules for data modeling simply wouldn’t fit the bill.
you see, i was (back then) doing to 3rd normal form modeling what some people want to do now to the data vault. i was breaking the rules, defined by codd & date. i was modifying the architecture to force-fit it to meet the needs of the data warehouse. i was breaking more rules by applying common sdlc project management to data warehousing. i was changing sdlc to meet the needs of the project.
3nf was designed for oltp. all of it’s rule sets were made for oltp, none of the rules met any of the adaptations of data warehousing. i found this out the hard way! after breaking the rules, changing the architecture and the design, i ended up with a data warehouse (3 months later) that required changes. along the way, i began to realize that the more rules and standards i broke (to make it fit with data warehousing) the harder it became to maintain the data model and the project.
i then tried star schema, much to my shagrin – i got the same result… well – if you don’t change where you’re going, you will end up where you are headed!
so what happened next?
i decided that there were some really good parts of the architecture from both 3nf and star schema that worked within the bounds of a data warehouse. there were also some really good parts of sdlc that worked. so i took the best-of-breed approaches and put them in a pot to stir up a new batch of architecture and methodology.
in reality, i realized that in order to “come up” with the right rules and standards, i had to follow the scientific approach. in other words, i needed to develop a rule, then test it against a set of criteria that all data warehouses face! without some form of scientific test and control experiment i would not get any plausible results, nor be able to test for failures.
if you wish to change, alter, modify the data vault rules and standards – and still call it a data vault – then you need to follow the same process! only after you submit your results to me, can i possibly consider a change to the architecture and/or the rule sets.
so what are the criteria you tested with?
the following are the criteria that i tested with, and that you must test with in order to submit for architectural changes or rule changes to the data vault model and/or methodology:
- scalability –
- model: please arrange 3 test cases to ensure your solution will scale, today’s scalability test should take the architecture/model to 300 terabytes without breaking. a theory should be formed about “what happens when you reach 500 terabytes, 1 petabyte, 2 petabytes, and 3 petabytes.”
- methodology: please arrange scale-up and scale-down of your team resources, or your project approach, monitor and measure the results of estimated hours versus actual hours, also watch the complexity rating (use a scoring mechanism for this, if you don’t have one – look at function point analysis).
- flexibility –
- model: please arrange 3 test cases for change to different parts of the data model. the most important outcome is to have zero re-engineering of up-stream and down-stream processes (including other parts of the data model, querying processes, loading processes, real-time loads, re-indexing, database activities, etc…) there is lots to consider here! be thorough and complete.
- methodology: please ensure that the changes you make can be accomplished within 2 hours on the data model, and that any new loading processes and any new queries can be written and unit-tested within 8 hours of a standard working day.
- any deviation from the numbers above, and your proposed change fails the test!
- repeatability / redundancy
- model: please arrange 3 test cases to apply the change to the entire model. any changes to the fundamental structure must be applied to the entire model. what’s done to one link must be done to all links – and still work! the same can be said for hubs and satellites. the test case must prove that it can work for all data vault models for all time, otherwise it is not a change worth discussing. if there is a “condition” setup, like a transaction link, then the condition must be weighed in view of the other standards. (remember, even with a transaction link there are no core modifications to the fundamental link structure – especially the primary key and business keys – and driving keys). too many conditions or exceptions will raise the complexity rating, and it will fail the flexibility test!!
- methodology: please ensure that the changes to the rules can be: monitored, measured, optimized. again, if you put too many “exceptions” around the methodology for implementation (for instance: only do this when…) then the method becomes unruly, impossible to measure. it goes without saying, if you can’t measure it, you can’t monitor it. if you can’t monitor it, you can’t optimize it. if you can’t optimize it, you can’t accomplish the task repeatably – you can’t generate or automate it!!
- business value
- model: if you cannot see, clearly explain, or measure the business value of the new attribute or change to the model, then it fails the case – and has no place in the data vault at all. you must clearly document the business value – in general, no matter what the business – everyone using the data vault must benefit in order for the change to pass the test.
- methodology: the business value is all in the measurement, optimization, and transparency. if your new process is secretive, then it cannot be open to review from the business. we in it are business people and it’s high time we started acting like it. we need to run it like a business (because that’s what it is), and in doing so our methodology must be transparent to our customers. if you cannot justify the business value of the process you are adding to the methodology, then it has no value and does not belong in the data vault methodology or project plan.
ultimately, all of these tests need to be applied across the board. they need to work in batch loads, real-time loads, real-time queries, data mining, alerts, mixed workload (real-time & batch at the same time, while querying). they need to pass physical performance tests in order to be allowed in to the standards. they also need to be technology independent. it shouldn’t matter (to the new standard) if the model is built on ssd, or in ram, or on a columnar db, or a relational db, or even a nosql db.
i welcome thoughts, comments, and ideas about how and when to change the data vault model – one of the current items under heavy review and scrutiny is in fact: end-dates in links. however, i would politely request, and encourage you, to go through the scientific process – as i did. that’s what i spent 1990 to 2000 doing in research and design with both the methodology and the data model standards.
it’s because i spent the time to test it, that it is currently stable, strong, and proving itself as a viable success in businesses around the world today.
this is also why i say to you: if the data model or methodology does not follow the standards, then it cannot be called a data vault.
i hope this clears the air, and i welcome discussion (both here and on the forums) about all these topics.