ROI - Return on investment

#datavault Models, Business Purpose, Data as an Asset

this entry takes a step far back in time, to when i created the data vault model.  i will share with you the thoughts around it’s creation, it’s intention and it’s original purpose.  the reason why this entry is here, is because i’ve begun hearing that: “the data vault model is not business focused” – and i want to put that notion to bed, immediately – because it simply isn’t true.  if this is what you believe about the data vault model, then a) you’re not building it right, b) you don’t fully understand it’s purpose c) you might never have taken the time to take an authorized training course from me or one of my authorized trainers.

data as an asset on the books…

one of the original intents of the data vault model was and still is, to provide value to the enterprise – in terms of data as an asset on the books. something that can be assigned a monetary value, and then depreciated over time.  now, how that is done is a very very difficult task (actually assigning a dollar value).  i will leave those thoughts for another blog post.  returning to the original premise…

hypothesis: in order for raw data to have value, it must: a) be tied directly to business processes, b) must contain enough completeness so that when a human looks at the data, they can make decisions about how to turn that data in to information (make it available to business decision making).  after all, this is where perceived value comes from – when the impact of the business decision is felt and measured against the profitability and overhead curves of the business.

supporting the hypothesis…

this means, in order to tie data directly to business processes, it must be linked (in some fashion) to business keys.   why?

imagine a ferrari, that at the dealer has a value (sticker price) of $245,000.  you purchase the car, and the dealer says: oh, wait, we lost the key.   what is the value of the car to you as an official and registered owner of the car, while the key is lost?

yep, zero.

once the key is found, and you drive the car home, the value of the car with the key (while it’s still on the dealers lot) is the purchase price of the car.

business keys are important, very important…

business keys are the unique identifiers which the business people use to find, assign, modify, create and enrich descriptive data to.  business keys are the unique identifiers which help the business people and the machines / applications tie the data together, and track it horizontally through the business (from sales to finance to contracts to manufacturing for instance).  without business keys, the value of the data itself is drum roll please… zero.

by the way, all of this and more is taught in any of my authorized cdvp2 (certified data vault 2.0 practitioner) courses)- you will not get this from anywhere else!  check out more at:

business keys are everywhere…

ever had a bank loan?  what was the loan number?  how about a wireless phone? what was the phone number?  how about a water or electricity bill?  what was your customer number?  what about a contract with someone, what was the contract number?

you don’t have to look very far in the real world to find your business keys.  the very companies you work for use them to track associations of data directly to you.  now, what’s the value of you as a customer for one of these companies?  the answer?  it varies over time – depending on how much money you invest in them, or how much money you pay them monthly or over the life of the contract.  but that’s just the measurable value. then, there’s the intrinsic value – because if you’ve spent money with them before, then chances are pretty good (unless something bad happens) that you will spend money with them again.

however, if you sign up with them again, you will most likely be assigned a new business key.  this key will be used to track your information back through the business processes.

ok, fine.  what’s that got to do with data vault?

the data vault model in it’s original form, is built to track business keys and their surrounding context through the lifecycle of business processes.  the business processes are executed by “source system applications” – they constitute the reality (not the perception of how the business should work) – but the reality of how the business actually is working.  so that said, we arrive at our first construct: the hub.

the original hub in the data vault model (when built properly, and not containing degenerate or weak keys, as some out there on the internet would have you believe)..  actually was just the business key.  you heard me right, no load date, no record source, no primary key (surrogate or hash).  it actually was, a purely all natural original business key value.  the hub was just that…  a hub (a unique list of business keys).

now, you think: well, there’s all this debate around surrogate vs hash keys, and debate around how to use  the load date, and how to use the record source.  let me stop you right there – none of these fields hold business value except: when it comes time to troubleshoot the arriving nature of the data.

that said, the original satellite had a “replication” of the hubs natural business key, combined with a load date – this had to be done in order to provide history, or data over time.  again, the original satellite definition had no record source, no load end date (which are dead now anyway), no current record indicator, or anything else… just the key structure plus the descriptive data that arrived on the source feeds.

which brings me to the link.  the original link had (again) a replication of just the business keys from two or more parent hubs.  no load date, no record source, no hash, no surrogate.

well, why is that important?

because folks are now arguing with me, and with others over “hash key versus surrogate key” – when there is no true business value there, and it is not related in any way to the original purpose of data vault modeling.

why are hashes or surrogates used as primary keys then?

one word answer: performance.   in reality, the surrogates (if you still use them or insist on using them) provide join performance – especially if they are clustered on non-mpp solutions.  on mpp, they really don’t matter, as the internal optimizer changes the values over to a different representation to find the data on a node/amp/module, etc..  hashes, on the other hand (as explained previously)  are there to solve heterogeneous cross-platform load performance bottlenecks – eliminating all the caching lookups that take place.  i refuse to get in to the technical details any further on this blog entry.  search my blog for many many posts about hashes, sequences, joins, bottlenecks, etc…

the point is: the original model never had these “technical performance components anyway” – so why are we still arguing over what to use?  we are losing precious valuable time when we should instead be focused on solving business problems.

so now tell me, what’s this got to do with data as an asset?

right, back to the business.  the original data vault model is positioned to help categorize and place in to a basic flattened hierarchy, business keys – and their surrounding context.  the original data vault model provides pure business value by demonstrating the gaps between the business perception (of the way they think their business is running) and the reality (the way the data is actually captured and processed through business processes and systems).

wait a minute… you’re telling me there’s gold in the raw data?

yep yep yep… especially when you build the correct data vault model.  if you build a source system data vault model, the value of the solution drops to one tenth of one percent overall.  integrate your business keys in your hubs – properly to achieve maximum asset valuation.

what’s the best way to build a data vault model?

don’t start with the data vault model, start with an ontology of business terms.  a properly built ontology of business terms (when taken to the right level of grain) can and should identify: (you guessed it!!)  business keys and their relationships.  a properly built ontology can even be assigned business estimated intrinsic value (ie: how much money would we lose if we had this key blank?, how much money will it cost us to have duplicates? how much money will it cost us if the data is incomplete or wrong?)  these valuations can be assigned, (and yes they need to be maintained through good data stewardship, and data governance.  but…  you can take a properly built ontology and generate the right data vault model.  yes, automation and generation.

this, leads ultimately to tying direct dollar figures to the data sets through a proper ontology and data governance strategy.  this is something that one of my earliest customers: qsuper in australia has been doing for years, all on data vault 2.0, and yes – with hash keys.  this, is something that can make sense to the business.

at the end of the day, you still should be constructing some form of a business based output layer.  should it be a business vault?  maybe – depends on the case.  it some cases today (more and more frequently) we are constructing virtual business views (virtual marts) directly on top of the raw data vault.  and if you’re unsure about this part (performance wise), then read up on point-in-time and bridge tables – the two most misunderstood, and misused, and misapplied modeling techniques in the data vault landscape today.

but i digress…  the summary of it all please..

here are my points:

  • data vault modeling never was about the “source system”, never was about the sequences vs hash key battles
  • data vault modeling was, is and always will be about the business.  and if the data vault you have in place today is not currently about the business, then unfortunately you’ve hired the wrong people, and those people need to go back to school and re-learn what data vault really means.  or you’ve built the wrong solution, and you need to fix it – immediately.
  • there is value in raw data – when integrated by business keys.  think gap analysis!
  • there is a need to understand truly what a hub is and is not, truly what a link is and is not, same with the satellite.  a link can never be a hub – sorry, that’s the way it is.
  • there is a need to tie data as an asset back to the business, this is done through the business keys.
  • ontologies are a very very important asset to the corporation – if built at the enterprise level, you must focus on ontologies while you are building the data vault solution, or the full value of the raw data vault cannot be realized.

there will be more on these subjects as we go forward.

thank-you for your consideration, as always, i’m open to your thoughts.

(c) copyright dan linstedt 2016 all rights reserved.

ps: if you want to offer a negative comment, or a differing viewpoint, that’s fine – but please don’t hide behind anonymous emails, don’t be afraid to tell the world who you really are, and take a stand by golly.

Tags: , , , , , , , ,

4 Responses to “#datavault Models, Business Purpose, Data as an Asset”

  1. Baraa Salkini 2017/08/18 at 6:23 pm #

    Thanks Dan, i am working on Data Vault projects for three years in very complex systems (no typical HR application) and i noticed somethings. I would like to share them with you:

    * At the start the D.V. Model looks good. but the Model starts to explode after many Changes in Source Systems, because of too many new Sataliets. We still have good performance, because we use only INSERT. But the Model getting larger and larger and It is getting hard to keep modeling in D.V. It costs us time during the Modelling, sometimes we open discussions like for example splitting Sataliets ,building new BRIDGES/PITS tables…etc

    *. While we using D.V. to build our DataMarts and other SoftRules i hear the following alot from different people in different projects, “I wish that our Stage-Layer is Core-Layer, its really easier to query and to understand”

    From my point of view, it is really not a bad idea, that we build a Core-layer that just looks like Sources without modelling, i see the following advantages:

    1. We don’t need to Model the Core, and all those time discussions about different Core-Modelling issues are not needed. -> We are Faster !
    2. No Normalization in Core ! Number of Tables from Sources = Number of Tables in Core -> Queries are faster (fewer joins).
    3. When we need to understand the business and processes behind in Source System, we usualy be in this situation where we need to understand the Model of the Source System. And could happens we start to build a queries on Source Systems to understand the Data and Business. Since the Core looks exactly like sources, this would makes lifer easier to make Business Rules and build the DataMarts.
    4. We still can you same techniques from D.V.: HashVaules to find the capture the new Changes, Only Insert.
    5. Loading the data in Core would be faster, because number of Tables are fewer.
    6. One Generic Job that loads the data from Source the Stage, and then from Stage to Core .
    7. We don’t need to do the changes to MetaData manually -> Automatisation. We read the MetaData from SourceSystems and we apply it to Core.

    I see here, that we can get more faster, more automatisation and the best part that we let our experts in Data Warehouse to focus more on Building and understanding Business, and they would be relieved from the Core of Data Warehouse.

    Thanks for reading and i would like to hear opinion

  2. Dan Linstedt 2017/08/22 at 4:14 pm #

    If the model is exploding, then perhaps it is not modeled correctly. It may require an assessment and refinement to correct what is happening.

    If you wish to use a fully historical persistent stage core layer, then be my guest. Just realize that the fully persistent staging core layer is lacking the following pieces: Delta tracking, Integration by business key, relationship and hierarchy data, and master data capabilities.

    I’m not here to tell you what works best for your organization, all I can do is point out the different failings and potential bottlenecks / roadblocks of proposed methods. I would suggest your organization might benefit from an on-site visit, and a full review of the solution being proposed, as well as your currently exploded Data Vault.

    It sounds to me as if what was constructed was “source system data vault models” – one per source system, rather than the actual purpose-driven Data Warehouse at the enterprise level which is the real directive of proper Data Vault Modeling.

    Hope this helps,
    Dan Linstedt

  3. John Giles 2017/10/02 at 10:16 pm #

    G’day Dan,
    I loved the posting, especially the bullet-point summary at the end. I have come across tool vendors (and Data Vault practitioners) who are very competent at loading source data into a raw Data Vault (RDV), but who miss some of your key points such as Data Vault modelling being primarily about the BUSINESS (emphasis yours). I think of the Agile Manifesto that says (for example) that working software is valued ahead of documentation. That doesn’t mean there should be no documentation. Likewise, it seems to me that the perspective of the business should be valued ahead of the way source systems hold data and Data Vault loads it. Going further, the way processes are performed around those source systems are important, but I see a danger that sometimes the processes are constrained by unwelcome limitations of source systems. So yes, I appreciate the tools that let us load raw data, but I think the business itself values being able to see its own data, in a business-centric manner.
    As you note, Dan, we must focus on the “ontologies” that represent the business. I love the Oxford Dictionary definition of ontology. After it presents the traditional “nature of being” philosophical view, it defines an ontology as being “a set of concepts and categories in a subject area or domain that shows their properties and the relations between them.” To me, the “concepts” bit of the ontology helps us find appropriate business-centred Data Vault hubs, and the “relations between them” bit guides us towards business-centric Data Vault links.
    One parting bit of light humour. I saw a T-short once that stated “there are 10 types of people in the world – those who understand binary, and those who don’t.” My observation is that there seem to be Data Vault practitioners who do bottom-up source-system Data Vault designs, and those who seek to do top-down Data Vault design, starting with business ontologies / concepts / relations. Personally, I feel more comfortable focussing on the business, with Data Vault as the platform to deliver against their expectations.
    Regards, John Giles.

  4. Dan Linstedt 2017/10/03 at 6:26 am #

    Hi John,

    Thank you for your insightful comments. Yes, there are several dangers: 1) business processes are broken, thus the DV model is built with a broken business process in mind, rather than the premise or expectation of how it should work. 2) Business is misunderstood, resulting in a poorly built DV model, 3) Source Systems have all kinds of limitations, resulting in “data overloading on the source”, models need to follow the business properly not the “overloaded source structures”, and so on.

    The business definitely values being able to see business centric data, no doubt about it. In that regard, the focus for Point in time, Bridge Tables, Business Vault, and Information Marts should always be the business targets… this is where we make “value” from data or turn data in to information.
    You are absolutely correct John, Ontologies are the KEY to successful Data Vault Modeling at the business level, it does guide us toward Business centric Hubs, Links, and Satellites.

    I agree with you in your observation of where to start, but remember: the Raw Data Vault at the end of the day “marries” the two together, bringing the business ontology to the proper level of grain SO that the raw data can be loaded and integrated by business key. This is what I call passive integration.

    Thank-you kindly,

    Dan Linstedt

Leave a Reply