Data Vault and Business Data Vault Mixed Together

when you have a raw data vault and a business data vault they typically exist as separate components, separate data models.  if you are looking in to operational data vaults, you may soon realize that because of the nature of real-time data moving in and out within the raw data vault, that you now have to combine the raw and bdv together into a single instance.  this is something i am doing in the odv’s that i am building.

i know, i know – in class i taught you to separate the two, which in concept is great, and in most installs that are not odv’s – it is a wonderful idea.  however in the case of odv, the need for timing (real-time operational data available to on-demand operational reports) means that they have to be pushed together. 

does this mean we mix raw and business computed data together?

i should say generally: no – not in the same tables.  generally what you should do is create another “layer” of data vault model on top of the raw data vault model.  for example: suppose i am dealing with medical data, and i have a raw table called people.  the people table absorbs (for a variety of reasons) multiple duplicate definitions from different systems.  i have a raw satellite from system a, a raw satellite from system b, and a raw satellite from system c.  what the consumer wants is a single master patient record (duplicated people records rolled together through some light-weight business rules).   to handle this situation, i added a patient hub, and a patient demographics satellite, and a people to patient link structure.

in class you talked about the fact that “latency means no time to cleanse/merge data” so how does this work?

yep, in class i did talk about this, and the rules still apply.  however in this case, the latency of “patient data” here is interesting because the operational application is actually editing rolled up patient data!!  external source feeds arrive once a day and are placed in to the people table.  so, this means that the operational transaction our odv receives is directly inserted in to the patient data and is available on the operational report immediately.  voila – we have a working odv without cleansing in the middle.

but what if the external source system (system a or b) sends transaction data?  what happens to timing then?

in this case, these systems send massive batch feeds, and the patient record is synchronized once a night with all the updates being combined.  however, to answer the question – you need to go back to the business user.  why?  because immediate response requires faster hardware, and faster hardware requires more money (usually a lot more money), so the business needs to justify spending that money (on a recurring basis) in order to achieve super low latency with high computing power.

why do they need high computing power?

well, if you are dealing with 1 to 2 second latency and you have a heavy routine full of logic to “merge/mix/match/consolidate” people records into patient records, then you need full power of parallelism to achieve this within 1 to 2 seconds.  especially if that person (who’s identified by a current transaction) has many historical records or many different person records that need to be munged together.   you might even need the resources of cloud computing to scale the parallelism with super fast access.

i can buy that, but what happens if i have 2 or more transactions from different systems for the same person at the same time?

these are the same problems that oltp systems deal with, and have been written up over the years in many different places.  use your oltp operational knowledge to apply the same answer to the operational data vault.  remember, an operational data vault takes on all the same characteristics of an operational system + a data warehouse.

back to the point of the post: you can & should mix the two (raw dv + business dv) models together, and use links to associate the tables.  you can and should use real-time mining algorithms to crunch, merge, rollup data sets in a master context inside the odv (by the way, never forget that data quality really should be running on the source systems!! – not always a possibility when they are external to your organization). 

the reality of it is: most business users are fine with the situation described here, and most business users don’t care, don’t know that an odv is being used.  in fact, the reality of this is: the situation here is that the operational application updates bdv tables not the raw dv tables, which means they get their data immediately.

hope this helps shed some light on operational data vaults, business data vaults, and raw data vaults.

dan l

Tags: , , , , , , , ,

7 Responses to “Data Vault and Business Data Vault Mixed Together”

  1. hortoristic 2010/04/19 at 11:09 am #

    Dan – I have implemented many DW/BI implementations using the Microsoft BI tool stack. Our state agency has a large DW that used the DV approach and I’m trying to read up and understand the approach. I’ve only done star schema dimensional modeling.

  2. dlinstedt 2010/04/19 at 12:37 pm #

    Hi Horton,

    Thank-you for your comments. Welcome to the club as it were. Please let me know what questions you might have, and also check my rate card. If your agency would like remote help/support – we can arrange that. You can also follow technical discussions at: – there are about 600+ users in the community. While the discussions are not frequently posted, I usually visit the forums at least once a week to check on what’s happening and answer specific items.

    I also have a data vault modeling book in the works – I hope to complete it this year. It should give a lot of insight into the data modeling side of the house.

    Again, let me know if I can be of assistance.
    Dan L

  3. Roelant Vos 2010/04/19 at 9:10 pm #

    Hi Dan,

    I’ve been working with this idea (a separate ‘raw’ and ‘cleaned’ area) for some time now so it’s good for me to see more background for this. In our architectures we use a similar split approach between data as-is and the global / enterprise business rules before actually applying structure and specific business rules in datamarts.

    What I’m still in doubt for is whether the PATIENT hub should get its own surrogate keys, since the original surrogate keys from the PEOPLE hub are still available. It’s a subset of the PEOPLE hub right (with some survive mechanisms as logic)? In this case the link table would be obsolete because the keys are the same. It would then be sufficient to only have a PATIENT DEMOGRAPHICS satellite which links to some of the PEOPLE hub keys.

    What’s your take on that?

    Roelant Vos

  4. dlinstedt 2010/04/19 at 9:22 pm #

    Hi Roelant,
    I believe that in this case patient must get it’s own unique ID. Why? because the OLTP app actually uses the Patient ID as a business key. Also, we frequently change (as a result of external data) which people ID’s are rolled up under which patient ID’s. It doesn’t always stay static through out time, our master source is outside of our control. But, it’s always a good idea never to “re-use” the sequence numbers from another Hub table. I realize you’re wish is to remove the link, but in the long run I believe it will just cause more heartburn for you.
    Let me know what you find after a year or two – or update us in a couple months… this is just my opinion mind you, and it fits the working situation I have, but it’s always good to share new ideas.
    Dan L

  5. Roelant Vos 2010/04/19 at 9:58 pm #

    Hi Dan,

    Thanks for your reply. Does the OLTP business key for patient mean that the OLTP table loads to both the PEOPLE and the PATIENT hubs simultaneously from the same source table? If patient is a source key I definitely understand the need for a separate hub. For my understanding: in your example, is the PATIENT entirely derived off the PEOPLE hub?

    You’re right, the deduplication business rules may change and this will cause the set of keys in my proposed satellite to change which will have impact on the datamarts. So that was not a good idea 🙂

    But maybe there is another way. In the case where a patient is entirely derived off the superset ‘people’ I also see an option where you create a separate PATIENT HUB which uses (a deduplicated subset of) the same keys as the PEOPLE HUB.

    But instead as proposed in my previous post, the link table PEOPLE TO PATIENT could cater for the rollup of which people are deduplicated to patients, using only the people surrogate keys. Isn’t that the best of both worlds? You can always relate to the original ‘raw’ data this way / i.e. backtrace from patient to people using the link while at the same time relate the patient key to all people satellies. And if the logic changes this will be reflected in the link table (with satellite). Since the hub keys are there for eternity this would not affect the ‘future proofness’.

    The only thing I can think of is that somewhere in the future a real patient source might be added, but this could as well become a different hub.

    Thank you in advance!


  6. dlinstedt 2010/04/19 at 11:10 pm #

    In this case: OTLP application is the source for Patient Records, Patient records are the master patient data. OLTP only uses Patient data, not person data. it’s the batch loads and the cleansing process that merge/mix the people data into a consolidated patient record. So – from an external data source, PATIENT is derived off PEOPLE, from an OLTP standpoint, PATIENT is an insert/update table.

    These are just some of the confusing issues faced when building an ODV.

    Yes, seperate PATIENT hub is in fact what I have, and the link does cater to the roll-up of the aggregated data.

    We already have a real patient source coming in. Currently it’s mixed in the batch process for cleaning up the data set and adding patient data. We do not see the PEOPLE table disappearing any time soon.

    I know, it’s confusing – but this is the way ODV changes the game.

    Hope this helps,
    Dan L

  7. Roelant Vos 2010/04/19 at 11:35 pm #

    Ah I see, thanks for your explanation!

Leave a Reply