i just received an email with a question about data vault versus dimensional. there have been many questions like this along the way from all parts of the world. in this entry, i’ll post the question – and i invite you to comment with your thoughts about how you might answer. i too will chime in and give my opinion in the matter.
and now, for the questions:
i’m working as a project manager/maintenance leader for our dw for a large retailer in . with this mail i hope for your recommendation/thoughts for a near future project we are going to implement.
we have a dw where (new) parts are built with the data vault principle (with etl engine) and others with the dimensional dm principle. we have just finished a pilot where we aimed to rebuild one flow from the “old school” (dimensional and stored procs) to the data vault (and etl). the source data seemed rather simple and almost no users or applications used the data.
the findings where:
– the “new school” gave a lot more tables which made the user applications had to rebuild as well (more risk)
– the rebuild itself where a lot more complex than first estimated. and yes that is not the data vault principles fault but still the school seems more time consuming than dimensional model as a first investment (yes probably the winnings will come later).
and now it seems to me that we shall keep the old school principle but rebuild it in etl to get the solution better monitoring, alarm functionality and also not be a spof-solution (since i do not really see the business profits rebuilding and changing principle . but what will happened to the dw if we have some parts that are built on data vault and others on dimensional principles? is that a future problem?
let’s address the findings:
– new school (data vault) gave a lot more tables which made the user applications rebuild as well…
#1: the data vault model is for use with a data warehouse. the data warehouse should be a backend system of historical storage and passive integration of business keys. the data vault model is *not* geared to be used by operational applications directly, unless a message bus (eai or queuing mechanisms) are used to pass messages back out in operational format to the application
#2: the data vault model should *not* be accessed directly by business intelligence applications. the data vault model is built to be a back-end data warehouse, and should serve as the enterprise memory store that provides you and your team the ability to quickly build new data marts (star schemas). the bi applications should still source star schemas, and should not source the data vault.
#3: the data vault is built for flexibility. because all relationships are extrapolated in to link tables you are bound to end up with “more tables” than star schema modeling. this however, is one of the flexibility component of the modeling techniques. more joins isn’t necessarily a bad thing. especially if you focus on the nature of mpp, parallelism, massively parallel loads, and massively parallel queries. which gives rise to scalability.
i address these concepts in depth in my classes that i teach, along with the coaching area that i have on-line. you can take direct on-line classes from me at: http://datavaultalliance.com
– the rebuild itself where a lot more complex than first estimated.
#4: more tables (in the case of the data vault) should make thing less complicated and much easier, faster, and simpler to build – both in the loading cycle, and the querying cycle. it’s a division of work. in fact, if the standards have been followed correctly for the data vault model, then you should be in a position to generate 90% of your loading etl routines in to the data vault, as well as the staging area. you should also be able to generate the baseline loads in to raw star schemas and from there, modify only the routines that need extra business logic. this should account for easier etl, not more complex etl.
if you have a more complex system, then something has gone wrong – either the standards are not implemented properly, or the model isn’t built correctly, or the data vault is exposed to the business intelligence applications.
in terms of generating etl, there are several tools on the market today that do this for the data vault, all you need to do is search the web.
and now it seems to me that we shall keep the old school principle
fair enough, if that’s truly what you want to do… but i will suggest to you that you might run in to the following problems:
- scalability issues. star schemas don’t scale very well in to the hundred plus terabyte ranges, if you have massive amounts of data for a star schema, you might want to think about netezza, or a columnar db, or a no-sql hadoop db backend. however, the data vault allows you to scale (due to the architecture) in to the petabyte ranges if so desired.
- flexibility issues. the larger the star becomes, the more conformed you try to make it – which results in hugely complex etl, and extremely slow loading times. it also results in incredibly long turn around cycles for your business users. i’ve seen it over and over again, eventually the weight of this turn-around time (to building in new requirements) will force your business to stop and restart the entire project, or build it themselves, or worse yet: outsource the whole thing, or basically kill the entire project.
- design time problems. adding new systems, and new systems data becomes a huge nightmare, especially the longer the “old-style” edw is in play. continual conformity of data drives the complexity ratings sky high, when the system becomes too complex, it becomes too costly. when it’s too costly, then all of those dreadful things happen (that i just mentioned in #2). eventually leading to re-engineering, and full and complete re-architecture.
the data vault model solves these problems once and for-all! again, the data vault is meant to be a back end system, with zero accessibility by direct business users – and certainly not for your business intelligence engines.
but what will happened to the dw if we have some parts that are built on data vault and others on dimensional principles? is that a future problem?
no, this isn’t a problem – as long as the data vault is used appropriately, and there are data marts put between the bi applications and the true data warehouse (data vault). the data vault only causes problems (as you’ve seen) when it is directly available to end-users, queries, and reporting tools.
the data vault model was not built for user accessibility, that’s the job of star schemas – and as you’ve discovered, the star schemas do a better job of it. no, the data vault was built for scalability, flexibility, and accountability on the back-end, for the life of the data warehouse.
if and when you have the proper data vault architecture in place, you can get good at building star schemas. i once managed a team of 3 people, and in 1997 we built new star schemas within 45 minutes of receiving the 2 page requirements document, and we didn’t even have an etl application in place! we had to code our business rules in to sql for sybase, and run stored procedures around it. bottom line is: our turn-around time was so fast, that many of the business units would come to us for answers, they refused to work with their own business units who took weeks and sometimes months to get them results!
we turned our it department in to a profit center for our lead business user, and without rapid turn around time like this (which you won’t see with star schemas as a data warehouse), you’re bound to end up getting the project shut down at some future point.
i hope this answers your questions, i would love to hear more about what’s happening with your project, and perhaps i can assist you through the coaching area.