this is an entry i posted on kimball university in 2009, about data vault and dimensional modeling. i challenged the posters to respond to my questions about their claims, but no-one took the challenge. on the other hand, it’s an interesting discussion, and begs to be reposted here. everything from system of record, to use of the data warehouse is addressed, so here goes…
i appreciate the thoughts – and yes, you are right to mention that there is not enough published on the subject. please remember that dr. kimball has published more information than i have to date, and he also has had many more years of marketing, speaking, and sharing than i have. i published the first article/paper on data vault in 2001 on www.tdan.com – it’s a technique that’s only been known to the edw / bi world for 9/10 years now. however, it is a strange phenomenon that people are not discussing the data vault model as freely as the kimball star schema.
anyhow, in case you were wondering: i agree with nick g in some areas – any good enterprise data warehouse should be accountable and auditable, able to pass audits – regardless of the data modeling technique used. yes, the data vault modeling techniques are a hub and spoke design – a mix of normalized modeling components with type 2 dimensional properties in the satellites. it is in fact a hybrid data model – spanning some of the flexibility of 3nf and combining it with some of the techniques of dimensional modelling. but that’s where the similarities stop. too often people confuse the data model implementation aspects with the “approach” of building a data warehouse or the framework so it were. the framework that the data vault modeling components rely on include a staging area, an enterprise data warehouse (physical not logical), followed by star schemas, cubes, and other data release mechanisms to properly push the data to the end-users or business. this framework is different than that of a “logical edw” made up of loosely affiliated star schemas combined with a staging area.
another difference is where the business rules occur. nick g hit it on the head, again – the data warehouse should be accountable and auditable – which by default means pushing business rules of changing/altering data downstream to “between the data warehouse and the data marts”. this is definately a principle i agree with.
i can’t speak for why people are not writing about it, but let me weigh in for a minute: dod, us navy, us army, edmonton police, sns bank, world bank, abn-amro, ubs, central bank of indonesia, monash university, central bureau of statistics (netherlands), denver university, city & county of denver, usda, diamler motors, air france, nutreco, microsoft, jp morgan chase, lockheed martin, and the list goes on – these companies are using/building or have used the data vault successfully over the past several years. there must be something new / different that the data model and standards bring to the table, or it wouldn’t be accepted / used by so many companies.
ok that said: let’s talk about the definition of system of record: here is where i disagree with nick g. i would argue that the definition of what a data warehouse is, and it’s purpose has changed over the years. in particular, when it comes to being the only place that real-time information is stored historically, or integrated with the other data. in this case, it becomes the only place where an auditor can actually perform an audit of how and what the data was that was collected along the way, hence it becomes a system of record.
it’s changing again – with a concept called “operational data warehousing” where the principles of a data warehouse are being pushed and re-defined by the business. the business actually builds an operational application right on top of the data warehouse, and edits the data set directly, creates new records directly – writing the history back to the warehouse in real time. in this case, also – it becomes a system of record. we have three different companies that have built systems like this and are functioning today. one is ing real estate hq, one is a congressional effort for medical records management of the armed forces, one is cendant timeshare resource group. like it or not, the introduction of “real-time data” flowing in to and being captured by the data warehouse makes it (by default) a system of record. more often than not the source data (after delivery to the edw) is destroyed on the machine that performed capture… especially if it’s a capture machine outside the edw.
anyhow, the point being that data warehouses also frequently house data from “old/retired” often dismantled legacy systems. even if the auditor wanted to, they couldn’t audit the source system because it’s been scrapped or junk-heaped somewhere. so in this case, the auditor treats the edw as a system of record, like it or not. i’ve actually been in this situation several times, and passed audits because of the manner in which i treat the data warehouse.
now, all of that is separate from the data modeling constructs used to build your edw. i’ve simply chosen to use the data vault for flexibility and scalability of the systems design. let’s talk about that for a minute…
would you agree that the most scalable architecture (between numa, mpp, smp, etc…) is mpp? massively parallel processing… or would you tend to say that some other architecture is more suitable to economies of scale? it seems to me that mpp is the clear winner over all these others. actually a hybrid is used these days to support cloud based systems. massive smp (symmetrical multi-processing) (ie: clustered machines) combined with mpp shared nothing architecture, to create a scale-out scale free cloud. anyhow, if the mathematics behind mpp have proven to be true (for example divide and conquer) then why not utilize this type of architecture within database engines? why not take mpp mathematics and apply them to the basics of data modeling by dividing and conquering or parallelizing the database operations?
a fist shot at this is called partitioning of the tables, but it has to go deeper – to where the joins live. there are age-old arguments about more joins/less joins and which is better, but at the end of the day it has to do with a balanced approach – how many joins can your hardware /rdbms software execute in parallel effectively? how much data can your machine access in parallel and independent processes? can you effectively make use of both types of partitioning (vertical and horizontal)? can you apply compression across the board and enhance performance? vertical partitioning is what the column based appliances / databases have been doing for massive performance increases in the query side.
the data vault model makes use of some of the basic ideas buried within the mpp realm – divide and conquer, apply compression to columns where redundancy lives, and apply unique high speed 100% coverage to indexes where possible. hence the hub & spoke design. the larger the system (or the larger the number of tables), the more i can split the tables/system up across multiple independent machines. it’s a pure fit for virtualized computing/cloud computing resources. it’s also a scale free architecture, because i don’t need to store all my data in a single geographical place to get at it quickly. any how, the hub and link structures enable the model to join in 100% parallel across mpp hardware / rdbms that is configured properly, and at economies of scale – the mpp query will always win over a serialized join (once the machine and database are tuned appropriately).
anyhow, there’s a lot more to this than meets the eye, it may look like a simple data model – but it’s the reasons why we do what we do that make it powerful. it’s the purpose behind the architecture that is currently driving companies like the netherlands tax authority and the dod to select the data vault for use in some of it’s largest (data set wise) projects.
now folks might ask: well, isn’t it overkill for small datawarehouses? maybe, maybe not – depends on the cost of modifying a “stage/star schema” architecture down the road, vs weighing the cost of modifications to a data vault. hence a company like phillips electronics also undertaking a data vault effort to consolidate or centralize raw data feeds from all across their divisions.
what i would say is this: if you’re happy building star schemas as your data warehouse, great – if it’s working for you, wonderful – don’t change. but if you run into situations that require heavy re-engineering, cause your project to “stop & restart” or be completely rebuilt every 2 to 3 years, then you might want to look at the data vault as an alternative. again, if you have little to no pain in the current architecture you are using, then don’t change – don’t fix what’s not broken….
i hope this helps, as always – take what you like, use the good things that are there. take the time to investigate new ideas, throw away the rest. after all, this is just more information. and remember: the world needs star schemas, cubes, and alternative release mechanisms – we couldn’t feed the business needs without them! i’m merely proposing the data vault model as an enterprise data warehouse layer, not as a data delivery layer… leave that to the expert: dr. ralph kimball.
please feel free to contact me directly if you wish.
i appreciate your thoughts on these ideas, and hope that you will comment on these posts with your experiences. i really want you to succeed, and i’d love to help you succeed, but i can’t do that if i don’t hear from you! so if you’re stuck, or feeling your implementation of the data vault hasn’t gone as well as you’d hoped or expected, then please – contact me today, let’s see if we can’t get it back on track and working for you instead of against you.
hope this helps,