Conceptual, Canonical, and Raw Data Integration

I received an email with many many questions at the conceptual level of integration.  It is (to say the least) a very interesting mail, with many points that I discuss in this entry.  In this entry, I’ve posted the original statements made by the person sending me the email, along with my formal responses.  I find it a very interesting discussion, and hope that you will add your comments to the end of this post.

The Email:

Message: Dear Mr. Linstedt,
I work as a data architect and started some time ago an analysis of the Data vault modeling pattern and new EDW architecture; actually I also purchased the first edition of your book, but now I am regretting not to have waited for the second one 🙂

So I am sorry for disturbing you this way, I just wondered if you might be interested in a short dialogue concerning a challenging, probably not very specific data integration aspect that seems to conflict *somehow* with the new EDW/DV architecture rules. Here it is, please feel free to decide if you want to involve yourself in this discussion:

First of all, the thoughts I would like to share with you are coming from a concrete project where we are trying to reengineer our complex data warehouse landscape. So it is real life stuff, more or less. And sorry for my verbose saga, I hope you might find it also interesting. I will try now to guide my discourse around what I understood so far from the EDW/DV architecture, what I distilled from it, but ‘projected’ on my specific concern:

1. First of all, it seems non debatable to me that the data vault pattern equals in terms of a software artifact to a “canonical data modeling pattern”. Hubs, Satellites and Links are the components of the pattern and there are more or less formal rules and degrees of freedom to instantiate the pattern. It is like a higher, BUT semantically STRUCTURED! normal form.

2. The pattern clearly separates ‘identity concerns’, the Hub (with its business key), from temporal-variant and temporal-invariant aspects (different satellites) and offers an extendable mean to evolve the schema, by giving a uniform representation. So far, so good. But a subtle problem already arises, which I didn’t find very well articulated anywhere, so here is my current position: What SHOULD BE actually the CONCEPTUAL instantiation of the HUBs ?

Obviously, a DV can be CONCEPTUALLY instantiated in at least two major ways: –
Source-affine, meaning that the DV gets instantiated along the native concepts (=HUBS) of the sources; if it’s Siebel CRM, than Siebel model will show the way, etc…

I don’t want to elaborate on this as it not *my case*, it has however limitations and problems. –

EDM-affine (EDM=Enterprise Data Model), meaning that the DV gets instantiated across a neutral, business-oriented representation (Enterprise data model) of the *relevant* part of the company.

The problem domain of the EDM are obviously the “business activities” (processes, events) that define the business and of course their “context” (involved parties, products, etc. = “master and reference data” in the solution space).

Now if one would like to build a DWH based on this EDM pattern, which is *my case*, then a few essential aspects have to be mentioned:

1. First and most important, the EDM must be projected onto the data offering space; it doesn’t make any sense to specify a cube with facts and dimensions that cannot be fulfilled from the sources.

This data integration pattern is known as Local-as-View (to contrast it with Global-as-View). Local-as-vies expresses a source data model as a view over the EDM and not the other way around like GaV does it.

There are some mixins GLAV, but it is not so relevant for my question. So under the bottom line, the EDM is a conceptualization of the needed data offering (sources) but in a business-oriented way (the concepts business want to analyse and ‘see’ on the screen).

Btw, the EDM is also the language of requirements and separate the business people from ‘conceptually understanding’ all source data models. It is the same to say that there always MUST be a mapping (a “T”) from the conceptual schema of the source to the EDM.

2. As the EDM equals “business” for the business people, it will be visible on their analytics atrifacs (cubes, reports, etc.). But where should the EDM manifest in the DWH solution space ? Obviously, it has to be the basis for the ‘presentation area’ (here in Kimball sense), so manifesting in dimensions and facts (star schemas for example or MOLAP cubes).

But in the presence of a EWH layer (3NF or DV) should it also manifest onto this layer ?
I believe YES!
Especially in the DV architecture, the key concepts of the EDM will ideally instantiate the HUBs , imho.

Now you might ask yourself, where is the conflict ? There are two issues in my opinion:
[First an excursus: it took me some time to distill what the new DV/EWH “Big T” is all about and I ended to identify the T being the ‘structure’ transformation to a constellation of star-schemas; nothing conceptually happens here, it’s pure syntax-oriented (from DV to STAR) –
I hope my assumption is not false] –

First, depending on the conceptual gap between the source schemas and EDM, a ‘big T’ might be needed. This, under the essential assumption that the business keys of the DV get instantiated from the EDM – this is imo very interesting, as it is (or could be), the DVault

MAJOR data integration task! – get the transition from the conceptual schema of the source to the EDM, BUT BY NOT MODIFYING THE DATA itself!! – to which extent this might be an issue , I didn’t analyze, but if we would have an issue, than it would be a No-No with the DV architecture, meaning that the EDM could not be instantiated into it;

however, I assume it goes well most of the time. So under the bottom line, here we have another “Big T” that has to happen when instantiating the data vault in a business-oriented way. If you find my thoughts legitimate , than pls apologize

if this was obvious, as to my best knowledge I didn’t find it clearly articulated anyway so far. – But things get complicated even more.

There are business events the business people want to analyse, that are not captured as such at the source! Think for example at the booking of a new optional price plan within the context of a wireless service agreement. Typically the CRM system records this a ‘state change’ of the agreement object cluster (typically with effective dates).

So there must be a place/layer that must ‘derive’ the “option booking event” from the old and new state of the agreement objects. The DV would have recorded the old state and with the last ETL cycle the new one also, but now a derivation logic must create a ‘derived’ HUB/Link for the “booking option” event.

This “T” here might also not be really small, but if you get it done, you are on the king way in my opinion. Caveat: when instantiating the DV with an EDM pattern reflecting the business activities of interest for the business analysts and in the presence of a conceptual but closable gap between the sources and the DWH(EDM), then, there might be another “big T” within the DV architecture.

This might even require an instantiation in two phases for deriving the business events thru ‘state difference’ derivation logic (I assume tacitly here that the ETL doesn’t access the warehouse when doing T and L, but of course this might be a simplifying assumption; I would try to do it in more stages). Dan, Thanks for the patience of reading my saga and please do apologize for contacting you directly this way; my hope is that you will find these issues also as challenging I find them and maybe come back with your opinion on these thoughts. In the end all is about not using the DV orthodox ally, as probably my described case shows it.

My Reply:

Hi,

I am sorry it has taken me so long to respond.  I’m just now getting back to emails from a couple months ago, and noticed that we had not had the chance to discuss the email that you put together for me.  Let me see if I can clarify a few points for the sake of discussion.

First of all: let me state that _everything_ I’ve built, architected, designed, and written has been fully tested and implemented in US government agencies beginning in 1993.  Nothing I propose is purely “hypothetical”, nothing I designed is “theoretical only” – everything I’ve built also, very concrete with specific needs that forced the rules and standards to appear based on necessity.

1. You stated: “DV Pattern = canonical data modeling pattern”.  You are correct.  It is a higher, but semantically structured normal form.  It draws its’ structure from a distilled business-term ontology (if you can call it that). Where the business keys are the most important “subject”, crossing functional lines of business.

2. Yes, the pattern separates “identity concerns” from “relationship concerns” from “context (descriptive concerns over time)”.  The actual conceptual instantiation of the HUBS should be a single business key.  In the theoretical world, the only thing we “truly” care about is a unique list of business keys.  However, when we reach the physical world, we need auditability and temporal knowledge (hence the record source and original load-date to the warehouse).  We also (unfortunately) need a surrogate key in order for the database engines to maintain join performance.

The second “Conceptual pattern” is the relationship: which in the conceptual model would consist ONLY of the business keys.  However, once again in the physical sense, we replace those business keys with the “surrogate representation”.  Furthermore, we ALSO need the “record source for auditability”, and the “load-date” stamp to know the FIRST TIME we see the information arrive within our warehouse.  Ultimately a “good database engine for data warehousing” should automatically cover these things internally, but they don’t.

The third “conceptual pattern” is the descriptive context, and these are the Satellites.  These are the data over time.  Ultimately they would contain the “business key” or the “set of business keys/relationship” that they describe, and the temporal aspect under which they fall would essentially be maintained internally by (again) the database engine, however the database engines are lacking today, and don’t provide us with the record source, nor the temporality by which is necessary in order to maintain auditability and source-of-record.

Source-affine is NOT recommended.  I see this as a fall-back, and unfortunately it is common for people to build this way either because they are lazy (don’t want to do the hard work of understanding the business), OR because the source systems model is such poor quality that alignment is difficult, OR because the source data is so poorly document (or the quality of the data is poor with missing business keys, etc..).  So there are any number of reasons people choose this approach.  But I don’t like it, nor do I recommend it.

EDM-affine.  Most definately. This is the approach I generally choose to take, however what ends-up happening is a 50/50 model.  Where you do your best to align data sets according to EDM efforts, but the other 50% of the model becomes source-aligned.

Now, let’s make the following statement:  The alignment of the model along the “affine” choice has more to do with the architecture and the modeling effort than it does with the data set.  Because along the way, in order to maintain auditability, we MUST be able to reconstruct the “source system data set” as it stood for a given point in time, this is where the divergence begins to take place (or the questions as you so eloquently point out).

Let me make one more statement: It is because of this that the “Data Vault Model & Methodology” are bound to be raw-data, yet passively integrated by business key in order to remain auditable.  It is because of this, that 80% of the “work” involved in building the model is understanding the business keys, and how they flow through the business.  It is because of this that the Data Vault result is a RAW Data Vault, and is NOT geared to be a “master data solution, nor is it intended to be a “golden copy””   – In other words, the Data Vault is intended to be: “AN INTEGRATED VERSION OF THE FACTS FOR A SPECIFIC POINT IN TIME”.  The Data Vault is NOT a single-version-of-the-truth.  Interpretation of the facts is left for “downstream” processing (going FROM the Data Vault TO the Data marts, or Master Data System).

1. First and most important, the EDM must be projected onto the data offering space; it doesn’t make any sense to specify a cube with facts and dimensions that cannot be fulfilled from the sources…  EXACTLY – which is why in the Data Vault Methodology, we separate the FACTS from the INTERPRETATION of those facts.  The Interpretation (the polarized lense if you will) is left to “the result of loading data marts or cubes downstream of the Raw Data Warehouse….

Regarding “local-as vs global-as” views, there is a mix here, an advance if you will.  The Data Vault Modeling techniques offer “global-as” passive integration, by business key, WITHOUT changing the raw data sets.  You can then get either “GLOBAL-AS” data marts, or “Local-As” data marts downstream of the Data Vault itself.  The business key is the TIE to everything.

I disagree with your statements around GaV and LaV being independent…  The Data Vault Model & Methodology combines the two aspects, tying them together by existing business key definitions.  It’s a hybrid at this point, but ONLY at this point (tied by Business Key).  The SOURCE of the business key happens to be raw data, which ties it back to system driven availability.

What you seem to be missing is: “the separation of storing and integrating data by business key FROM the usage and application or interpretation of the data by business users”  To me these are two distinct activities.  The Raw Data Vault handles the former, while the Data Marts downstream align to “business usage and functionality” (the latter).  THIS is the sheer power of the Data Vault Model.

In my mind, there isn’t a single answer, there are two answers to this problem.  The conceptual definition of a business key often differs from it’s applied usage.  Furthermore, the applied usage of business keys often “changes” depending on the hierarchy of business responsibility in which it lives.  (ie: customer number can mean different things to different people, but ultimately the KEY used to identify a single customer should be consistent across the entire business).

“there must be a mapping between the conceptual schema and the source schema”  Agreed…  The Data Vault Model is an intermediate step (it’s not a source schema per-say, but it’s not a conceptual only schema either) – it’s halfway between.  I argue that there are two parts to the “T” as you put it….  One that worries about arrival timing, latency, sourcing the data, and the other that worries about application of the data set, functionality, and business alignment.  What I’ve done is “split the two up”, so that the warehouse model handles passive integration, and passive alignment (doesn’t change the raw data but still achieves partial integration), and the downstream process building the data marts provides the remaining business alignment and adjustments.

2. “but in the presense of an EWH layer (3nf or DV) should it also manifest onto this layer?”  NO.  Absolutely not.  I have been in way too many situations and seen too many projects go down the tubes (because of constant re-engineering) that are caused by combining these layers.  Yes, the key concepts of EDM instantiate the Hubs, BUT the APPLICATION and INTERPRETATION of the data is left for downstream, and should have NO impact on the model what-so-ever.  This way the model can remain fluid, and adaptable, and flexible to change going forward WITHOUT the rising cost of maintaining complexity in “basic integration”.

Big “T” as you call it, has two fundamental steps: Integration, and Application – I’ve split these apart, so that maintenance and auditability can co-exist, along with ease of use, and flexibility.  AND so that application/interpretation can change (as the speed of business change) on the way out, WITHOUT disturbing, losing, or changing any of the auditable history that SHOULD be stored in the Raw Data Vault (EDW).

Now, is this a major data integration task?  Yes and no.  The Data Vault Model is meant to be fluid, to be flexible – and to allow the designer to reset a finite number of tables without cascading change impacts (based on what they learn along the way).  It is a top-down architecture with a bottom up design principle.  Build what you know TODAY, and change the model as you learn things tomorrow.  The MAJOR data integration work is a people based process: UNDERSTANDING THE COMMON BUSINESS KEYS.  And of course, we do NOT allow modified data in to the Raw Data Vault at any time.  This is not an issue, in fact just the opposite, it is a requirement to be accountable and auditable as a system of record.  It is the only place in the organization to house this “integrated view of raw data aligned with business keys”.

I never claimed that the entire EDM can be instantiated to the Data Vault – nor should this claim be made.  again, only PART of the EDM is represented: the Hierarchy itself (relationships), and the Business Keys themselves (entries in the hierarchy).  The rest of the EDM (application and definition of what to do with the data, and how it’s used) is left for downstream processing when building data marts.  THIS is the true power behind this methodology, sort of like “separation of church and state”, or “checks and balances”

Now, what this does to “BIG T” is break it in to two pieces:  “easy or little T” (raw data loads and passive integration) and “BIG T” (business rules and data alignment).  We do not have two “big T” processes to accomplish, only one (as usual), and it becomes easier and more agile to produce data marts downstream because “the sourcing issues, and arrival latency problems” have already been solved.

Now, you make reference to “data vault in a business oriented way”  – This can happen, BUT you must have a Raw Data Vault in place before this happens.  Most of the time “the business interpretation” of the data is left to downstream Data Marts – modeled any way you see fit (from Cube to flat-wide, to 3rd normal form, to star schema…) it is all viable.

Things actually get less complicated here…  Business NEEDS and wants to see the “gaps” between their business perception, and what their sources are collecting.  Without RAW data with historical patterns integrated by business key, this is near impossible to uncover.  Business also needs and wants to see the “gaps” between their applied business rules (downstream marts), and their auditability (or lack thereof), and thus require the RAW data to be bounced against the business rules and data marts in order to understand these.

There is a place/layer to derive the “option booking event” – that is in the Big T logic going from the Raw Data Vault downstream to the Data Marts.  I do not typically recommend the Data Vault model for use with “derived business data” – that job falls on data marts, and star-schema models are much more suited to business user access than Data Vault.  The Data Vault model is suited for passive integration, scalability, flexibility, and agility (of the team members).

“Caveat: when instantiating the DV with an EDM pattern reflecting business activities of interest for the business analysts and in the presence of conceptual but closeable gap between the sources and the DWH (EDM), then, there might be another “big T” within the DV architecture”  – no, not true.  What was missed is the subtlety here – we split the work, divide and conquer.  Raw Passive Integration (little T going IN to the Data Vault), and Big T (business rules & application of the data) coming out of the Data Vault on the way to the data marts.

I hope this clears things up, but if it raises more questions than it answers, perhaps we should have a web-conference.

Thank-you kindly,
Dan Linstedt
PS: As always, I’d love to hear your feedback – please comment on the end of this blog entry.

Tags: , , , , , , , ,

2 Responses to “Conceptual, Canonical, and Raw Data Integration”

  1. Joey 2011/06/07 at 5:19 pm #

    Hello Dan,

    I liked this particular post because i think there are some points mentioned that a lot of people are wrestling with it to get it right. For me this part of the DV methodology was already clear. One topic on the EDM symantic model in relation to the datavault is still in the “gray area”.

    Getting a good intermediate DV with a EDM is do-able as long as you have one authentic source for one EDM “entity” (for example a customer). Things get much harder if you have two, or even more 5 or 6 systems, that do not share a customer base but have one on their own. Where everything comes together in the EDW. And yes i know in this case you maybe have a master data problem 🙂 but lets assume we are not able to fix that.

    In the post you say that the model has a small “T” up front, this sounds logic, but in case of multiple (different) sources for the same entitie(s) that are not shared, how does the datavault handles that? When the symanticlayer does not fit the source data. Conceptualy for the business it’s the same but the data in the system is very much different.

    When we must deal with mutliple source systems, it is clear to that the datavault supports this by integrating on the business keys. Things must get very messy when integration on the business keys can’t be done, i mean it is only one attribute 🙂 ( but i guess you have seen them enough?!)
    In my opinion the gray area still exists on integration on the links and sats, what about this? When do we integrate them, and when not? I must say it is not so much clear which rules must be applied to stay compliant with DV.

    For example you get customers from 5 different systems and the business wants to report on all of them. They are all identiefied by a numeric number and symantically are on the same aggregration level (by accident). Do we combine them in the hub -> Yes,right? Integrating on the business keys, but what about the link and the sats? Some attributes are commen, like first and lastname and could by in one sat structure, but some attributes are available from source x but not in source system y, or attributes do have a data type mismatch. Or because one system has a bad datamodel and for example stores gender as a “code like “M” and “F” while other systems have a reference data table for it and store it with an internal “ID”.  Or one of the system has only one adres line where street postal and adres are combined,  and the other systems have it in sperate fields. This is a simple situation but things get harder in cases of source systems that store history on changes on the reference data.  The small “T” is getting bigger and bigger.

    The simple answer to this question could be; if the “shoe does not fit”, generated a seperate structure of links and sats.  this will generate a lot of overhead and does not feel very well, and will make the “T” downstream even harder. But okay it is possible. But this kind of situations get more painfull when source systems are changing attributes that first appeared to be generic attributes, but because of change they are not compatible anymore with the EDM symantic datavault model.

    I would be very pleased if you can shine a light on this subject in relation to this post about the symantic model.

    Thanks in advantage
    Joey.

  2. dlinstedt 2011/06/07 at 9:13 pm #

    Hi Joey,

    I appreciate your feedback, and thank-you for taking the time to read through this lengthy post. Please be aware that I cover some of these topics in the Certified Data Vault Modeler class. When I talk about integrating systems, I come from a background where I usually have to integrate 10, 15 or even 30 different systems across multiple lines of business – along with internal and external data. Most of the questions you ask are focused on the Implementation side of the house, and these types of questions will be answered on my new e-learning platform (yet to be announced). Where you will have access to implementation classes that discuss these topics in detail.

    Their is a lot of grey area out there, as I have not had time to write all the articles nor the implementation book around these topics – but I also hope to offer in-person advanced classes to discuss implementation specifics. These will be provided here in St. Albans Vermont (north of Boston Mass. USA) Links and sats are integrated based on additional rules and standards, And yes, you are correct – I’ve not yet had the time to write enough about this subject.

    One quick & Short answer is to say: each source has it’s own Satellite of attributes (this is a best practice for a number of reasons). However, again, this is a complex topic (as to why this is the case, and WHEN to choose this case over others) – these things really need to be covered in person to understand the depth required for making such decisions.

    One thing that is absolutely certain: what I teach about how to implement, keeps the Small “T” … small. It does not grow with size, or complexity based on the number of sources, or the integration effort needed to load the data in. This is a function of the standards and rules that define the loading capacities…. No, the small T stays small, the Big T will grow, shift, shrink, and change (flex) as the business does.

    Notice I did not necessarily recommend this case for “everything”, because that is typically not the answer to articulate design patterns. There is no single simple answer, and this is precisely why I am typically hired as a guide to the process, as well as why smart individuals like you find themselves enrolled in my classes. Unfortunately to shed more light on this topic would require a deep dive in to your specific situations. Implementation is all about “making it work, while keeping the complexity down and sticking to the best practices (where possible). It’s not always a cut and dry answer that will fit every situation.

    I will take these questions in to consideration for future posts.

    I hope this helps,
    Dan Linstedt
    PS: You can learn more about the DV Modeling architecture by going to http://LearnDataVault.com
    or by contacting me about training classes and on-line e-learning.

Leave a Reply

*