when you have a raw data vault and a business data vault they typically exist as separate components, separate data models. if you are looking in to operational data vaults, you may soon realize that because of the nature of real-time data moving in and out within the raw data vault, that you now have to combine the raw and bdv together into a single instance. this is something i am doing in the odv’s that i am building.
i know, i know – in class i taught you to separate the two, which in concept is great, and in most installs that are not odv’s – it is a wonderful idea. however in the case of odv, the need for timing (real-time operational data available to on-demand operational reports) means that they have to be pushed together.
does this mean we mix raw and business computed data together?
i should say generally: no – not in the same tables. generally what you should do is create another “layer” of data vault model on top of the raw data vault model. for example: suppose i am dealing with medical data, and i have a raw table called people. the people table absorbs (for a variety of reasons) multiple duplicate definitions from different systems. i have a raw satellite from system a, a raw satellite from system b, and a raw satellite from system c. what the consumer wants is a single master patient record (duplicated people records rolled together through some light-weight business rules). to handle this situation, i added a patient hub, and a patient demographics satellite, and a people to patient link structure.
in class you talked about the fact that “latency means no time to cleanse/merge data” so how does this work?
yep, in class i did talk about this, and the rules still apply. however in this case, the latency of “patient data” here is interesting because the operational application is actually editing rolled up patient data!! external source feeds arrive once a day and are placed in to the people table. so, this means that the operational transaction our odv receives is directly inserted in to the patient data and is available on the operational report immediately. voila – we have a working odv without cleansing in the middle.
but what if the external source system (system a or b) sends transaction data? what happens to timing then?
in this case, these systems send massive batch feeds, and the patient record is synchronized once a night with all the updates being combined. however, to answer the question – you need to go back to the business user. why? because immediate response requires faster hardware, and faster hardware requires more money (usually a lot more money), so the business needs to justify spending that money (on a recurring basis) in order to achieve super low latency with high computing power.
why do they need high computing power?
well, if you are dealing with 1 to 2 second latency and you have a heavy routine full of logic to “merge/mix/match/consolidate” people records into patient records, then you need full power of parallelism to achieve this within 1 to 2 seconds. especially if that person (who’s identified by a current transaction) has many historical records or many different person records that need to be munged together. you might even need the resources of cloud computing to scale the parallelism with super fast access.
i can buy that, but what happens if i have 2 or more transactions from different systems for the same person at the same time?
these are the same problems that oltp systems deal with, and have been written up over the years in many different places. use your oltp operational knowledge to apply the same answer to the operational data vault. remember, an operational data vault takes on all the same characteristics of an operational system + a data warehouse.
back to the point of the post: you can & should mix the two (raw dv + business dv) models together, and use links to associate the tables. you can and should use real-time mining algorithms to crunch, merge, rollup data sets in a master context inside the odv (by the way, never forget that data quality really should be running on the source systems!! – not always a possibility when they are external to your organization).
the reality of it is: most business users are fine with the situation described here, and most business users don’t care, don’t know that an odv is being used. in fact, the reality of this is: the situation here is that the operational application updates bdv tables not the raw dv tables, which means they get their data immediately.
hope this helps shed some light on operational data vaults, business data vaults, and raw data vaults.