i’ve been reflecting on recent events, and having some great discussions with many of you in the community. for that, i thank-you. i wanted to take a minute to return to the roots of the data vault, and why it was created in the first place. i will also do my best to provide a non-biased (even though you think i may be) minor comparison of the kimball / data vault statements going around the industry.
the data vault is made up of three basic parts:
- architecture (systems architecture) or 3 tier architecture
- methodology (rules around how, and why)
- model (standardized data model with strictly defined entities)
the data vault is nothing new, it is a hybrid design made up of “best of breed” from multiple normalized formats (1st, 2nd, 3rd, 4th, etc..) and dimensional modeling (dimensions and facts). as such it is not a be-all-end-all solution, nor should everyone implement a data vault. there are times when a data vault is not necessary, nor even warranted for building. those times might include some of the following reasons:
- you are building a single-business-unit focused answer set
- you do not need an enterprise view
- you do not need an auditable historical data store (as you have all the data in a single source system, backed up forever)
- you do not have a number of external systems to integrate (3 or more?)
- you are running a columnar data store, key=value store, nosql store, or denormalized store (like netezza)
i’ve always maintained (and still do) that i am not here to sell you on data vault. if you want to use data vault, and you would like knowledge or training around it, then that is one of the services i can provide (among others in the industry who also provide excellent services too). i am not here to tell you that using data vault will somehow be the answer to solving all your data warehousing problems; it won’t. the data vault model (being a hybrid data model design) will solve some of the problems you might see in enterprise data warehousing, but has plenty of other issues that will still exist.
the data vault systems architecture is based on a three-tier architecture. this is a classic architectural design, where the kimball bus architecture is based on a two-tier architecture. the debate rages on – do you need three tiers? are two tiers sufficient to isolate the business and the structures from the impact of changes? this is all for you to decide. some would argue today – that one tier data warehouse architecture is enough, and i would have to say, in certain situations (like 100% real-time data feeds with 100% real-time analytics and alerts) that this statement may in fact be true.
why does the dv recommend three tier architecture? well – in a batch oriented situation (even mini or micro batch) it is still (today) helpful to replicate data from the source to some form of a staging area for preparation of load to the enterprise data warehouse. the middle tier (the data vault model layer) is introduced to isolate the business presentation layer from the impact of adding source systems, adding new data, changing structures of historical storage, and to isolate the historical storage layers from the impact of changing the presentation layers to the business.
the dv architecture does however revert to a two-tier architecture as you approach near-real-time solution. the staging area “disappears”, with the data sets now arriving on a message bus or message queue, and delivered passively to the data vault model – in other words, the dv based edw becomes a listener on the message queue. then you can decide what to do with the transaction further downstream in the analytical and release sides of the house.
at the end of the day, you need to decide if you want a two-tier architecture (batch oriented), two-tier architecture (100% real-time oriented), three-tier architecture (batch & real-time oriented). this is where your decision lies.
the dv methodology is nothing new either – it is based on tried and true best practices for data warehousing, project management, etl management, risk analysis, risk reduction, risk mitigation, cost measurement and containment, etc.. etc.. etc.. – all of these things usually go in to creating software – in other words, we are software engineers to a degree – when we build etl in and around data warehouses. hence the ties to best practices of software engineering. like being agile, or implementing parts of scrum, using jad/rad sessions in conjunction with use cases to get your requirements. all of this needs to be applied from the methodology side of the house. the only thing interesting about the methodology of the data vault is that most of the artifacts can and or should be auto-generated. why? because the data model is pattern based with strict rule sets.
does the dimensional model have patterns? can it be generated too? sure. the simple patterns are dimension and fact. but where it gets hairy or difficult to auto-generate, is when the business rules are involved in loading these conformed entities thats where the data vault has a minor advantage – why? because the business rules have been moved downstream – meaning going from the dv model in to dimensional models. now, can you auto-generate a raw data based dimensional model? sure, but you might lose the ability to conform data (maybe, maybe not).
can you auto generate the etl for both of these architectures? sure, you can base-line it, and if you are loading raw data to your dimensional warehouse, then voila – you are set. can you do this with anchor models? key=value pair models? hadoop & nosql models? yes, yes, yes – every model that has a pattern, that you opt to store raw data in, the etl /loading processes can be generated.
ok, down to brass tacks. what are some of the issues or sticking points for the data vault model? let me see if i can list a few here:
- there are more joins – because all relationships, events and transactions are listed as many to many tables. you can think of these as factless facts. they would be defined exactly the same way, so if you are creating factless facts in a dimensional model at the lowest level of grain – then you are building almost as many tables in your dimensional model as you would build in the data vault.
- the tables are narrower – packing more rows per block, requiring more physical cpu power and parallelism (parallel query/parallel etl) from the infrastructural components. but if you partition your tables in a data mart vertically (split the dimensions in to narrow dimensions for instance) you are doing exactly the same thing as the data vault model proposes.
- a main focus on this model is finding business keys, attempting to tie the physical model to the place in the business processes where the business keys are passed. this requires the modeler to better understand the business. but after all – isn’t that supposed to be a part of data warehousing in the first place?
- more etl processes – due to standards, rules and best practices we (in data vault land) end up with far more processes than the etl teams are used to. this can cause a raucous, until the team understands that each of the processes runs at optimal speed (generally), and is much simpler because there is no business logic to contend with – no “conforming” of the data sets.
ok, so here’s the point: if you look at columnar databases, you get the same thing: joins (ie: vertical partitioning) covered by specialized hardware for extreme parallelism. if you look at a raw data dimensional model, you get similar statements or draw backs – especially if you do not conform the data sets (resulting in vertical partitioning or splitting of the dimension tables).
at the end of the day, there is really only one big difference between data vault models and alternative data modeling methods (2nd, 3rd, 4th, 5th normal forms) and dimensional models. that difference is:
* the way parent-child relationships are handled
yep – that’s it. that’s the whole enchilada. that’s the”big secret” if there really is one. (and i personally don’t think its a secret). but anyway… if you look at it like this:
* dimension to fact = parent to child
* dimension to fact = fact stores relationships, transactions, events for a point in time
* dimension table = stores descriptors, hierarchies, temporality, business keys, and sometimes history (depending on dimensional type)
* master key table = stores key lookups, key hierarchies, key resolution
* helper table = stores dimension to dimension joins
* junk table = stores “undefined / uncategorized” information
* snow flaked dimension table = relates hierarchies in a fixed 1 to many architectural definition
data vault models:
* hub to link = parent to child
* link table = link stores relationships, transactions, events, for a point in time (good) , links also store hierarchies, links store key lookups, key hierarchies, key resolution, link table stores business key to business key joins (covers what a helper table covers), covers the relationship between hubs (what the snow-flaked dimension referential integrity would hold)
* hub table = stores business keys and business keys only
* satellite table = stores descriptors, temporality, history, can store undefined or uncategorized information – but generally forces it to be attached to a business key of some sort
at the end of the day, it comes down to preference. what it is you like, what it is you are comfortable with.
if you like the parent-child relationships split out in to “its own structure” then you would like the data vault model. if you would rather represent these parent-child relationships in a fixed architectural design (meaning data model enforced) then you would prefer (mostly) the dimensional model.
in closing i just want to say:
there are pros and cons to adding more tables, adding more parallel resource requirements to the infrastructure. it’s not so much the data model that needs to be weighed here, as it is where you want to place your business rules, the choice of two-tier or three tier architectures, and the requirements of parallelism or the ability to divide and conquer.
* it is my personal belief that a) you can do all these things with a raw data based dimensional model, but b) if it is raw data based dimensional model, then you can no longer conform the data sets. because to have raw data means: no coalescing, no changing of the data sets, no standardization, no conforming at all. at that point, you end up with roughly the same number of tables in a raw dimensional model as you would end up with in a data vault model.
anyhow, i hope this helps clear the air. data vault is not for everyone, nor will everyone like the data vault. if you end up wanting to try it, please feel free to do so. there are plenty of knowledgeable people out there willing to help.
all the best,