Nature of DV Structure

recently i received a contact inquiry from an interested individual.  in this contact there are many different questions, one of which is specifically about the architecture or design side of the house – as in: are there two design patterns that are followed or only one, and if there are two – which one is correct?  in this entry i will address the nature of business keys, the singular design pattern of the data vault and try to correct what thought processes you may or may not have around constructing the model from the top.

this is not something i’ve written much about, although you can catch glimpses of it in my presentations, more glimpses of it are available in my classes, and of course – i will elaborate at great length within the coaching area of my site (going forward).  in other words, in the coaching area, i will share with you *how* to make this work for the best light of the business.

the statements.

the question i received is related to the following two statements (as quoted in his words):

1. first of all, it seems non debatable to me that the data vault pattern equals in terms of a software artifact to a “canonical data modeling pattern”.   hubs, satellites and links are the components of the pattern and there are more or less formal rules and degrees of freedom to instantiate the pattern. it is like a higher, but semantically structured! normal form.

2. the pattern clearly separates ‘identity concerns’, the hub (with its business key), from temporal-variant and temporal-invariant aspects (different satellites) and offers an extendable mean to evolve the schema, by giving a uniform representation.

so the first question from the individual  is: what should be actually the conceptual instantiation of the hubs ?

my thought process:

i’ve always been quirky and off the wall.  my wife says i never see things the same way that everyone else does – and even my dad said to me: i knew you were different when, at the age of 2 i’d ask you to point to the elephant picture, and you would point to the hippo.  i knew you understood me, because later i would see you pointing to the elephant and repeating the words.

ok, with that out of the way…  instead of providing a narrative (like usual) i’ll try to explain this in a step-wise fashion.

  1. businesses create source systems (or purchase them) in order to hold data, and operate along the perception of what they think (they perceive how their business works).
  2. source systems get old, people move on, but the rules in the source systems usually change at a much slower rate than the business does…  so to compensate, business overloads data values (creates what might be perceived as dirty data by other business units) in order to force the system to hold data it isn’t engineered for.
  3. businesses require keys to the information in all the systems, in order to identify specific sets of records.  this is where the notion of customer, product, part, employee, etc… comes in to play.  the business believes they have uniqueness in their data.
  4. new systems are acquired or built to make up for gaps in the existing or old systems.  new systems often create new keys, and have new business processes that differ from the old systems.  frequently these new systems have overlap in functionality.
  5. some (not all) business units begin using the new systems, while other business units continue using the old systems.  once this happens, the business units using the new systems move or re-create their data sets in the new systems, causing: duplication, new keys for the same data, different representations,  and on and on and on.
  6. the bi systems try to reconcile these systems and end up in a pickle (with different results every time)
  7. remember: both the old and the new systems have different execution rules that currently don’t align with existing perceptions of how the business is supposed to operate.
  8. now, enter the data warehouse…  part of our job is to align the data sets across business units (lines of business), internal and external systems, and multiple business processes.  ouch that hurts!  this is probably the single toughest challenge a corporation can throw at us (and they think this stuff should be easy!)
  9. what’s the one single consistent business element that all business users access when they want to retrieve their data from their systems and fit it in to their business processes?  right!  the business key.
  10. oops, i almost forgot – on top of the business processes, and the data in the systems exists a hierarchy of terminology called an ontology.  these ontologies are organized in to taxonomies. basically what i’m saying is: they categorize and prioritize their business terms, and they tie these to their business processes.  unfortunately many of these ontologies are not written down anywhere.  if they were, it would make the process of building a master data system 100x easier!

ok, now you understand the background – and i hope this has given you a bit of insight as to how to approach modeling business keys in the data vault, or even the data vault itself.  it’s a necessary process – of course, however it gets’ deeper from here.  those of you not wanting a challenge, or wishing to stick your heads in the sand and still claim that “building a data warehouse is easy” should stop reading now and go back to your easy-chair.

the next real question in integration architecture is…

really a series of questions that all have the same answer (unfortunately or fortunately depending on how you look at it).  this is really why if you ask a true bi / edw consultant a question, the answer most often is: “it depends.”  well, for the following questions, the answer is singular in nature (due to the mathematics of the problem, and the representation of set logic, and the implications of mpp scalability).

the questions are:

  1. how do i effectively tie the current perception of the executing business process to the actual data process in the source system?  how do i establish the gap between the two?
  2. how do i watch, monitor, and view the data that is used in the business process as it flows from one process to the next?
  3. how do i track the data that is used by the business external to the system that it is stored in?  there are plenty of manual processes that take place beyond system boundaries that alter data sets that we don’t see inside the capture systems.
  4. how do i tie the same data set to the business processes and system process that are ocurring?
  5. how do i map similar data sets across multiple systems yet maintain auditability of each set of information?
  6. what is the one mostly consistent piece of business terminology that is used to represent data sets in the business users’ world?
  7. how do i represent this information in an ontology that is understandable by the business at the end of the day?

and of course, then there’s the element of: how do i track all these sets of data over time?  and of course, all the technical questions like scalability, auditability, maintainability, simplicity, performance, etc…  we deal with a lot of issues – and quite frankly, our knowledge is often taken for granted…  anyhow, back to the point.

the answer to all of these questions is: the business key.  the business key is a representation of a single data element that is captured, stored, used, applied, printed, memorized, and searched on across all these situations.  it is a hard-data element that should not change (but often does, because business doesn’t understand that they hemmorage money when they do change it).  it is an element that should be consistent throughout systems, across business processes, and everywhere else.

it is a conceptual representation that also happens to have a physical instantiation.  it ties the conceptual world to the physical world – this is why the data vault model is based on business keys.  ie: hubs.  it allows the model to follow the structures, processes, systems, and “perceived business rules” that the business users are under the impression they are using.  it allows us to find and measure the gap between the perception and the reality of capture of the data held in the systems.

it allows us to trace the data flow, and measure the time it takes for critical path information to make it through the business.  the sooner the business realizes that this is a serious level of power,  the sooner they will begin to understand that this modeling technique can save the business millions of dollars and thousands of hours in integration, and consolidation, and other elements.

ok – off the soap box.  let me answer the original question(s) above.

quote from the inquiry:

obviously, a dv can be conceptually instantiated in at least two major ways: – source-affine, meaning that the dv gets instantiated along the native concepts (=hubs) of the sources; if it’s siebel crm, than siebel model will show the way, etc…i don’t want to elaborate on this as it not *my case*, it has however limitations and problems. – edm-affine (edm=enterprise data model), meaning that the dv gets instantiated across a neutral, business-oriented representation (enterprise data model) of the *relevant* part of the company.

the problem domain of the edm are obviously the “business activities” (processes, events) that define the business and of course their “context” (involved parties, products, etc. = “master and reference data” in the solution space). now if one would like to build a dwh based on this edm pattern, which is *my case*, then a few essential aspects have to be mentioned: 1. first and most important, the edm must be projected onto the data offering space; it doesn’t make any sense to specify a cube with facts and dimensions that cannot be fulfilled from the sources.

ok, here’s the long and short of it:  the person asking the question is really struggling with this question: what’s the best modeling approach (using data vault) to help identify and close the gap between the systems and the businesses?

do you see where i got that from?  it’s a gap analysis question.  the point is: if your systems are truly out of alignment from your business processes, then you will find the statements in the quote above to lead you to the same question.  “do i use the source systems to model, or the business processes?”

enough already, i hope i’ve given you some insights to the background of business keys and data vaults.  so here’s my answer, and at the end of the day, you need to decide what works for you and follow it.  why? because the dv model is flexible and fluid – if you build it right, you will never be completely finished with the model, you should be changing it as you learn more, and as you close additional gaps between business and systems and data sets.

conculsions and thoughts:

my answer is (obviously): it depends.    my preference is: always start with the business

the data vault is supposed to be: of the business, by the business, and for the business.  there isn’t supposed to be anything in the data vault that is not business driven.   if it doesn’t have business value, don’t model it in the data vault – it should literally be chopped/removed from the source systems.

now, that said: here’s the flip side of this point:

if the source systems were too far out of sync from the business processes, then the business would be losing so much money they might go out of business or replace the source systems and applications entirely.  therefore, when worse comes to worse and you don’t have time to learn the business processes, or can’t find business users willing to give you the time of day (that’s a real shame) – then sometimes you have to resort to what you have….  which of course is the data in the source systems, and the hardcore “data profiling” to help give you some basic foundational footsteps on which to begin building and producing results right?

i know which side of the bread is buttered…   i know where my paycheck comes from, and i understand the value of releasing something quickly to the executive staff.

here’s the magic folks in case you may have missed it:

  1. the data vault model is based on raw data – auditability
  2. the hub construct is based on the foundations of business keys – traceability/gap analysis
  3. the link construct is based on the notions of change – flexibility / mpp scalability
  4. the satellite construct is based on data changing over time – audit-trails, massive disparate systems/gap analysis

so, can you  start with one or more source systems and it’s data?  yes!  can you start with the business processes?  yes!  can two different teams start with different systems or even different approaches and still link the data together at a later date?  yes!

so you see, it depends.  if you missed the answer, go back and read the blue statement above.

oh yes, i almost forgot:  the data vault should not be seen or used directly by business users – so any questions of “interpretation” or “understanding” should be moved downstream to “data marts” – including the business vault.

do you have other thoughts?  opinions?  don’t be a stranger, i want to hear from you!  register for free and leave a couple comments….  tell me what you’ve found, what you experience, what your pros and cons are of using each method of approach.

dan l
ps: by the way, you can get lessons on how/why on these things inside my on line training:

Tags: , , , ,

No comments yet.

Leave a Reply