as you might imagine, the data vault (just like any other data modeling technique) has it’s issues, pitfalls, and limitations. some of the original limitations of the original modeling techniques have been overcome, but not all. note: update posted here….
this is an evolutionary step in the process of information modeling, and as such will have additional refinements going forward that will improve it. as the inventor, i feel it necessary to share such pitfalls with the community, thus providing the true picture and not just a “colorized” version of it that is all rosy and warm. if you have comments, please sign up for the forums and offer critical review, or issue statements inside.
in this page we take a look at some of the risks of using a data vault model, along with some of the successes. we also look at some of the risks of other data modeling concepts as well, both from a business and technical standpoint. the first white-paper in the series explains some of the evolution, timelines, and pro’s and cons so we will not re-iterate those issues here. what we will discuss is some of the business battles, performance concerns, and data versus information debate. we will shed some light on what we’ve done in the past to make the data vault a success in different businesses.
dave wells of tdwi wrote an interesting article about the data vault and it’s position in the warehouse. in this article he also discusses the concepts of master dimensions, master fact tables. we’ll explore some of those issues below.
data vault issues
the data vault is just that, a “data” based architecture. while it is purposely architected around the business, it is driven by the data that we keep in our source systems. it is juxtaposed to the knowledge proposition which is driven from information. in order to make sense of the data beyond pattern recognition, integration, and flexibility we must 1) improve the data quality, 2) merge, mix, match different data sets 3) multi-dimensionalize the data.
we have proposed an architecture that separates and moves the business processing or information processing rules downstream, beyond the data vault. in other words, moving data out of the data vault and back into the presentation layers for usability by business. in doing so, we have defined a storage format (form) of information into which we load raw data. the only rules that apply to the data coming into the data vault consist of an integrated and consolidated view of the lowest level of grain of data. this data is separated by rate of change, type of information, and semantic meaning.
the architecture doesn’t care or pay attention to how the information can or should be used, nor does it discern which information is “right versus wrong” (mostly because right and wrong are perspective based depending on whom is examining the data. thus, the data in the data vault is not for end-user access (direct access), it is for power-user access and data mining or discovery operations. this leaves us to figure out what to do downstream from the data vault.
- how do we turn our data into information?
- does it make sense to “stop” at different information stores for certain data?
- how can real-time be realized if the data has to stop in different places?
but before we get into answering some of these questions, let’s take a look at some of the business and technical issues that face us if we build a data vault.
- data in the data vault is non-user accessible.
- data in the dv is not “cleansed or quality checked”
- benefits of the dv are indirect, but very real.
- more up-front work is required for long-term payoff.
- business users believe (in the beginning) that they don’t need an “extra copy of the data”
- elegent architecture is secondary to business churn.
- using a dv forces examination of source data processes, and source business processes, some business users don’t want to be accountable, and will fight this notion.
- businesses believe their existing operational reports are “right”, the dv architecture proves this is not always the case.
- business users from different units must agree on the elements (scope) they need in the data vault before parts of it can be built.
- currently there is only one source of information exchange, there are no books on the data vault (yet).
- some businesses fight the idea of implementing a new architecture, they claim it is yet unproven.
- modelers struggle to grasp the reasons behind “not enforcing relationships” on the data model level.
- data vault model introduces many many joins
- data vault model is based on mpp computing, not smp computing, and is not necessarily a clustered architecture.
- data vault contains all deltas, only houses deletes and updates as status flags on the data itself.
- data must be made into information before delivering to the business.
- modelers must accept that there is no “snowflaking” in the data vault.
- stand-alone tables for calendar, geography, and sometimes codes and descriptions are acceptable.
- 60% to 80% of source data typically is not tracked by change, forcing a re-load and delta comparison on the way into the dv.
- tracking queries becomes paramount to charging different user groups for “data utilization rates” and funding new projects.
- businesses must define the metadata on a column based level in order to make sense of the data vault storage paradigm.
just because we’ve listed the issues, doesn’t mean there aren’t mitigation strategies. in fact, in about the data vault, and in the forums, are some of the mitigation strategies that will help you overcome these issues. on the other hand, as with any evolving or changing architecture, there will come a time when this one will be changed too (to meet future needs that aren’t yet seen).
there are issues with understanding the application of the data vault, which come in different forms. understanding how the hubs and links are utilized, built and represented are the most difficult. once these techniques are mastered, they can be applied anywhere in the model – in a repeatable and redundant fashion. take the hub for example, the questions i commonly get are: why store only one copy of the business key? especially if i get the key arriving from multiple sources? the answer is quite simply: integration, raw level integration of the business information is of utmost importance. in this manner, the same key may represent different descriptive data, and tie together the currently dis-integrated source data. if on the other hand, the two keys actually represent physically different semantic meanings, then they must be separated into two hubs. if the two keys represent two different “customers” for example, but are at the same semantic layer, then a record-source must be added to the business key (unique index) of the hub. this is actually a business problem that must be recorded and fixed.
dave wells and i sat down and discussed the nature of the data vault, and perhaps it’s a very fitting name. we both agree that there is something more, something bigger to be had and that’s a common or master set of “information”. so the next question we face is: how do we turn data into information?
data vault versus information
the data vault houses data, as we’ve discussed above. unedited, unchanged flat-out data. information can be found when the data and structures are processed with specific functionality – say for instance, data mining, or further integration and aggregation of the data to build and load a star schema. dave has come up with something interesting he calls “master dimensions, master facts, and master cubes” as an information platform. in the past, this type of processing has happened on the way out of the data vault and on the way to “report collections” (flat wide highly denormalized reporting tables) – this is just another name for a data mart. it has typically been built into a second layer of staging tables (that was in 1997 when the hardware and software hadn’t advanced to what it is today).
these days, turning the data into information certainly requires additional stomping grounds, additional processing and complex business rules. after all, “version of the truth” is hard to come by, and can change as often as the business management team, or the business itself changes. if we look at this from a different angle: batch versus real-time, a whole host of issues crop up ranging from refresh rates of data to size and volume of data going through those complex rules for processing. we also end up with an n-dimensional method that the business users want to examine the data from. in other words: sales by finance by contracts, crossed with executive vs decision makers vs management vs line workers, cross with campaign management vs crm vs erp and so on. there are n dimensions by which to examine this information (that’s not the tricky part).
the tricky part is getting the answers to agree in all the dimensions, and sourcing off the same data. it requires a great deal of patience on the part of the business users, in order to allow it the proper time to identify the grain in between. it also requires an audit phase prior to release of the data into regular business operations, and the ability to aggregate by x dimensions in the same data store in order to provide the queries with consistent answers. finally, it requires the ability to strip out bad, old, or unwanted data before processing, and separate that information into separate stores. oh yes, we nearly forgot: business user training is necessary (by the business for the business) in order to understand which collections or data marts have which data, and what grain its’ stored at.
these are the business users that will write the new reports against the information stores housed within the warehouse. sometimes the warehouse team is required to manage accountability through this process as well, and in doing so must meet at least “two versions of the truth with the same answer at the same time.” we had a case where we had to match sales revenue to finance revenue, and provide both answers at the same time. finance revenue wanted to see all the adjustments in the time period in which they applied, sales wanted to see an aggregate of the adjustments and what the total was at the end of each month.
we solved this problem in a dual-entry report collection. by dual-entry, we mean dual-entry date. we had two dates: the applied date and the changed date (or loaded date). accounting rolled up the grain by applied date. sales rolled up the numbers by changed date, each had their version of the truth, and they “agreed with each other.”
in terms of turning data into information, this is different for each business – this is where the business rules certainly do apply. however, there are a few things to note about scalability and complexity.
- as complexity of the business rules increase, scalability decreases. they are in direct opposite proportion to one another.
- as scalability decreases, the amount of data that can be passed through that process either takes twice as long (exponential rise), or it must limit the data set to 1/2 as much in order to process in the same time frame as previous run-times.
- data which is in error, must be separated from data which is not in error – this is where the colored lenses of “truth” change the view of data/information. the data in error must be sent to “error marts” if it is marked bad enough to cause mistakes in the aggregate calculations downstream. otherwise, if it doesn’t cause mistakes in the aggregate calculations, it can be marked by virtue with default values, and doesn’t need to be re-routed to error marts.
- complexity limits the possibilities for parallelism, and scalability – therefore sending 100% of the data through all business rules for every load, becomes an impossible task (based on exponential rise of complexity). in other words, analyzing the entire data set over and over and over again eventually is too high a cost for the business to bear.
- information is useful when it holds summaries of the data in bite-sized and understandable chunks. this is when the business can make better decisions based on aggregate information. that’s not to say that lots of small aggregates presented in the right way can’t help business, it’s to say that 1) too much data can easily overwhelm the decision maker, and 2) too much data aggregated into averages can easily lead the user astray. a fine balance must be reached between the two.
there are other issues, and knowledge points which will be discussed within the forums going forward. please sign up for the forums today, and offer your opinion on how to handle such information. as we all know, there are pros and cons to every technique in the market, this one is no different. what is different about this approach is that it is an evolutionary step to the next level of information modeling.