i received an email with many many questions at the conceptual level of integration. it is (to say the least) a very interesting mail, with many points that i discuss in this entry. in this entry, i’ve posted the original statements made by the person sending me the email, along with my formal responses. i find it a very interesting discussion, and hope that you will add your comments to the end of this post.
message: dear mr. linstedt,
i work as a data architect and started some time ago an analysis of the data vault modeling pattern and new edw architecture; actually i also purchased the first edition of your book, but now i am regretting not to have waited for the second one 🙂
so i am sorry for disturbing you this way, i just wondered if you might be interested in a short dialogue concerning a challenging, probably not very specific data integration aspect that seems to conflict *somehow* with the new edw/dv architecture rules. here it is, please feel free to decide if you want to involve yourself in this discussion:
first of all, the thoughts i would like to share with you are coming from a concrete project where we are trying to reengineer our complex data warehouse landscape. so it is real life stuff, more or less. and sorry for my verbose saga, i hope you might find it also interesting. i will try now to guide my discourse around what i understood so far from the edw/dv architecture, what i distilled from it, but ‘projected’ on my specific concern:
1. first of all, it seems non debatable to me that the data vault pattern equals in terms of a software artifact to a “canonical data modeling pattern”. hubs, satellites and links are the components of the pattern and there are more or less formal rules and degrees of freedom to instantiate the pattern. it is like a higher, but semantically structured! normal form.
2. the pattern clearly separates ‘identity concerns’, the hub (with its business key), from temporal-variant and temporal-invariant aspects (different satellites) and offers an extendable mean to evolve the schema, by giving a uniform representation. so far, so good. but a subtle problem already arises, which i didn’t find very well articulated anywhere, so here is my current position: what should be actually the conceptual instantiation of the hubs ?
obviously, a dv can be conceptually instantiated in at least two major ways: –
source-affine, meaning that the dv gets instantiated along the native concepts (=hubs) of the sources; if it’s siebel crm, than siebel model will show the way, etc…
i don’t want to elaborate on this as it not *my case*, it has however limitations and problems. –
edm-affine (edm=enterprise data model), meaning that the dv gets instantiated across a neutral, business-oriented representation (enterprise data model) of the *relevant* part of the company.
the problem domain of the edm are obviously the “business activities” (processes, events) that define the business and of course their “context” (involved parties, products, etc. = “master and reference data” in the solution space).
now if one would like to build a dwh based on this edm pattern, which is *my case*, then a few essential aspects have to be mentioned:
1. first and most important, the edm must be projected onto the data offering space; it doesn’t make any sense to specify a cube with facts and dimensions that cannot be fulfilled from the sources.
this data integration pattern is known as local-as-view (to contrast it with global-as-view). local-as-vies expresses a source data model as a view over the edm and not the other way around like gav does it.
there are some mixins glav, but it is not so relevant for my question. so under the bottom line, the edm is a conceptualization of the needed data offering (sources) but in a business-oriented way (the concepts business want to analyse and ‘see’ on the screen).
btw, the edm is also the language of requirements and separate the business people from ‘conceptually understanding’ all source data models. it is the same to say that there always must be a mapping (a “t”) from the conceptual schema of the source to the edm.
2. as the edm equals “business” for the business people, it will be visible on their analytics atrifacs (cubes, reports, etc.). but where should the edm manifest in the dwh solution space ? obviously, it has to be the basis for the ‘presentation area’ (here in kimball sense), so manifesting in dimensions and facts (star schemas for example or molap cubes).
but in the presence of a ewh layer (3nf or dv) should it also manifest onto this layer ?
i believe yes!
especially in the dv architecture, the key concepts of the edm will ideally instantiate the hubs , imho.
now you might ask yourself, where is the conflict ? there are two issues in my opinion:
[first an excursus: it took me some time to distill what the new dv/ewh “big t” is all about and i ended to identify the t being the ‘structure’ transformation to a constellation of star-schemas; nothing conceptually happens here, it’s pure syntax-oriented (from dv to star) –
i hope my assumption is not false] –
first, depending on the conceptual gap between the source schemas and edm, a ‘big t’ might be needed. this, under the essential assumption that the business keys of the dv get instantiated from the edm – this is imo very interesting, as it is (or could be), the dvault
major data integration task! – get the transition from the conceptual schema of the source to the edm, but by not modifying the data itself!! – to which extent this might be an issue , i didn’t analyze, but if we would have an issue, than it would be a no-no with the dv architecture, meaning that the edm could not be instantiated into it;
however, i assume it goes well most of the time. so under the bottom line, here we have another “big t” that has to happen when instantiating the data vault in a business-oriented way. if you find my thoughts legitimate , than pls apologize
if this was obvious, as to my best knowledge i didn’t find it clearly articulated anyway so far. – but things get complicated even more.
there are business events the business people want to analyse, that are not captured as such at the source! think for example at the booking of a new optional price plan within the context of a wireless service agreement. typically the crm system records this a ‘state change’ of the agreement object cluster (typically with effective dates).
so there must be a place/layer that must ‘derive’ the “option booking event” from the old and new state of the agreement objects. the dv would have recorded the old state and with the last etl cycle the new one also, but now a derivation logic must create a ‘derived’ hub/link for the “booking option” event.
this “t” here might also not be really small, but if you get it done, you are on the king way in my opinion. caveat: when instantiating the dv with an edm pattern reflecting the business activities of interest for the business analysts and in the presence of a conceptual but closable gap between the sources and the dwh(edm), then, there might be another “big t” within the dv architecture.
this might even require an instantiation in two phases for deriving the business events thru ‘state difference’ derivation logic (i assume tacitly here that the etl doesn’t access the warehouse when doing t and l, but of course this might be a simplifying assumption; i would try to do it in more stages). dan, thanks for the patience of reading my saga and please do apologize for contacting you directly this way; my hope is that you will find these issues also as challenging i find them and maybe come back with your opinion on these thoughts. in the end all is about not using the dv orthodox ally, as probably my described case shows it.
i am sorry it has taken me so long to respond. i’m just now getting back to emails from a couple months ago, and noticed that we had not had the chance to discuss the email that you put together for me. let me see if i can clarify a few points for the sake of discussion.
first of all: let me state that _everything_ i’ve built, architected, designed, and written has been fully tested and implemented in us government agencies beginning in 1993. nothing i propose is purely “hypothetical”, nothing i designed is “theoretical only” – everything i’ve built also, very concrete with specific needs that forced the rules and standards to appear based on necessity.
1. you stated: “dv pattern = canonical data modeling pattern”. you are correct. it is a higher, but semantically structured normal form. it draws its’ structure from a distilled business-term ontology (if you can call it that). where the business keys are the most important “subject”, crossing functional lines of business.
2. yes, the pattern separates “identity concerns” from “relationship concerns” from “context (descriptive concerns over time)”. the actual conceptual instantiation of the hubs should be a single business key. in the theoretical world, the only thing we “truly” care about is a unique list of business keys. however, when we reach the physical world, we need auditability and temporal knowledge (hence the record source and original load-date to the warehouse). we also (unfortunately) need a surrogate key in order for the database engines to maintain join performance.
the second “conceptual pattern” is the relationship: which in the conceptual model would consist only of the business keys. however, once again in the physical sense, we replace those business keys with the “surrogate representation”. furthermore, we also need the “record source for auditability”, and the “load-date” stamp to know the first time we see the information arrive within our warehouse. ultimately a “good database engine for data warehousing” should automatically cover these things internally, but they don’t.
the third “conceptual pattern” is the descriptive context, and these are the satellites. these are the data over time. ultimately they would contain the “business key” or the “set of business keys/relationship” that they describe, and the temporal aspect under which they fall would essentially be maintained internally by (again) the database engine, however the database engines are lacking today, and don’t provide us with the record source, nor the temporality by which is necessary in order to maintain auditability and source-of-record.
source-affine is not recommended. i see this as a fall-back, and unfortunately it is common for people to build this way either because they are lazy (don’t want to do the hard work of understanding the business), or because the source systems model is such poor quality that alignment is difficult, or because the source data is so poorly document (or the quality of the data is poor with missing business keys, etc..). so there are any number of reasons people choose this approach. but i don’t like it, nor do i recommend it.
edm-affine. most definately. this is the approach i generally choose to take, however what ends-up happening is a 50/50 model. where you do your best to align data sets according to edm efforts, but the other 50% of the model becomes source-aligned.
now, let’s make the following statement: the alignment of the model along the “affine” choice has more to do with the architecture and the modeling effort than it does with the data set. because along the way, in order to maintain auditability, we must be able to reconstruct the “source system data set” as it stood for a given point in time, this is where the divergence begins to take place (or the questions as you so eloquently point out).
let me make one more statement: it is because of this that the “data vault model & methodology” are bound to be raw-data, yet passively integrated by business key in order to remain auditable. it is because of this, that 80% of the “work” involved in building the model is understanding the business keys, and how they flow through the business. it is because of this that the data vault result is a raw data vault, and is not geared to be a “master data solution, nor is it intended to be a “golden copy”” – in other words, the data vault is intended to be: “an integrated version of the facts for a specific point in time”. the data vault is not a single-version-of-the-truth. interpretation of the facts is left for “downstream” processing (going from the data vault to the data marts, or master data system).
1. first and most important, the edm must be projected onto the data offering space; it doesn’t make any sense to specify a cube with facts and dimensions that cannot be fulfilled from the sources… exactly – which is why in the data vault methodology, we separate the facts from the interpretation of those facts. the interpretation (the polarized lense if you will) is left to “the result of loading data marts or cubes downstream of the raw data warehouse….
regarding “local-as vs global-as” views, there is a mix here, an advance if you will. the data vault modeling techniques offer “global-as” passive integration, by business key, without changing the raw data sets. you can then get either “global-as” data marts, or “local-as” data marts downstream of the data vault itself. the business key is the tie to everything.
i disagree with your statements around gav and lav being independent… the data vault model & methodology combines the two aspects, tying them together by existing business key definitions. it’s a hybrid at this point, but only at this point (tied by business key). the source of the business key happens to be raw data, which ties it back to system driven availability.
what you seem to be missing is: “the separation of storing and integrating data by business key from the usage and application or interpretation of the data by business users” to me these are two distinct activities. the raw data vault handles the former, while the data marts downstream align to “business usage and functionality” (the latter). this is the sheer power of the data vault model.
in my mind, there isn’t a single answer, there are two answers to this problem. the conceptual definition of a business key often differs from it’s applied usage. furthermore, the applied usage of business keys often “changes” depending on the hierarchy of business responsibility in which it lives. (ie: customer number can mean different things to different people, but ultimately the key used to identify a single customer should be consistent across the entire business).
“there must be a mapping between the conceptual schema and the source schema” agreed… the data vault model is an intermediate step (it’s not a source schema per-say, but it’s not a conceptual only schema either) – it’s halfway between. i argue that there are two parts to the “t” as you put it…. one that worries about arrival timing, latency, sourcing the data, and the other that worries about application of the data set, functionality, and business alignment. what i’ve done is “split the two up”, so that the warehouse model handles passive integration, and passive alignment (doesn’t change the raw data but still achieves partial integration), and the downstream process building the data marts provides the remaining business alignment and adjustments.
2. “but in the presense of an ewh layer (3nf or dv) should it also manifest onto this layer?” no. absolutely not. i have been in way too many situations and seen too many projects go down the tubes (because of constant re-engineering) that are caused by combining these layers. yes, the key concepts of edm instantiate the hubs, but the application and interpretation of the data is left for downstream, and should have no impact on the model what-so-ever. this way the model can remain fluid, and adaptable, and flexible to change going forward without the rising cost of maintaining complexity in “basic integration”.
big “t” as you call it, has two fundamental steps: integration, and application – i’ve split these apart, so that maintenance and auditability can co-exist, along with ease of use, and flexibility. and so that application/interpretation can change (as the speed of business change) on the way out, without disturbing, losing, or changing any of the auditable history that should be stored in the raw data vault (edw).
now, is this a major data integration task? yes and no. the data vault model is meant to be fluid, to be flexible – and to allow the designer to reset a finite number of tables without cascading change impacts (based on what they learn along the way). it is a top-down architecture with a bottom up design principle. build what you know today, and change the model as you learn things tomorrow. the major data integration work is a people based process: understanding the common business keys. and of course, we do not allow modified data in to the raw data vault at any time. this is not an issue, in fact just the opposite, it is a requirement to be accountable and auditable as a system of record. it is the only place in the organization to house this “integrated view of raw data aligned with business keys”.
i never claimed that the entire edm can be instantiated to the data vault – nor should this claim be made. again, only part of the edm is represented: the hierarchy itself (relationships), and the business keys themselves (entries in the hierarchy). the rest of the edm (application and definition of what to do with the data, and how it’s used) is left for downstream processing when building data marts. this is the true power behind this methodology, sort of like “separation of church and state”, or “checks and balances”
now, what this does to “big t” is break it in to two pieces: “easy or little t” (raw data loads and passive integration) and “big t” (business rules and data alignment). we do not have two “big t” processes to accomplish, only one (as usual), and it becomes easier and more agile to produce data marts downstream because “the sourcing issues, and arrival latency problems” have already been solved.
now, you make reference to “data vault in a business oriented way” – this can happen, but you must have a raw data vault in place before this happens. most of the time “the business interpretation” of the data is left to downstream data marts – modeled any way you see fit (from cube to flat-wide, to 3rd normal form, to star schema…) it is all viable.
things actually get less complicated here… business needs and wants to see the “gaps” between their business perception, and what their sources are collecting. without raw data with historical patterns integrated by business key, this is near impossible to uncover. business also needs and wants to see the “gaps” between their applied business rules (downstream marts), and their auditability (or lack thereof), and thus require the raw data to be bounced against the business rules and data marts in order to understand these.
there is a place/layer to derive the “option booking event” – that is in the big t logic going from the raw data vault downstream to the data marts. i do not typically recommend the data vault model for use with “derived business data” – that job falls on data marts, and star-schema models are much more suited to business user access than data vault. the data vault model is suited for passive integration, scalability, flexibility, and agility (of the team members).
“caveat: when instantiating the dv with an edm pattern reflecting business activities of interest for the business analysts and in the presence of conceptual but closeable gap between the sources and the dwh (edm), then, there might be another “big t” within the dv architecture” – no, not true. what was missed is the subtlety here – we split the work, divide and conquer. raw passive integration (little t going in to the data vault), and big t (business rules & application of the data) coming out of the data vault on the way to the data marts.
i hope this clears things up, but if it raises more questions than it answers, perhaps we should have a web-conference.
ps: as always, i’d love to hear your feedback – please comment on the end of this blog entry.