they certainly seem to be passe, past history, past tense – something that was hotly debated in the past. but i (personally) don’t see the need to debate these two approaches anymore. i think that both approaches are well-defined, and in this post i will go through a textual discussion (for those that can’t get to the video side of the house from their office).
the old “inmon vs kimball” debate has gone on long enough, eventually the war needs to end – and i think it has. there is no need to fight over the definition of data warehouse any longer. sure there are differences, sure there are best practices, but now, there is a hybrid solution. so why do we continue arguing? why do we continue wasting time, money, and resources on “whom is right?”
before i go further: i am a dw2.0 certified architect at the mastery level, i have been authorized by bill in writing to teach, speak, and write about it. i am also a cbip certified master, and iccp data management/data administration master. i understand both sides of the fence and have built both frameworks for customers.
the old-school of data warehousing thoughts
when people talk about these “wars” they often are referring to old-school thoughts, and yesterdays’ definitions of a data warehouse. first, when ever we think of “data warehousing” or “business intelligence” we have to split the components out properly:
- framework (what to build, enterprise guidelines, component layout)
- architecture (systems, data, process)
- methodology (how-to-build, best practices, standards, project, people, risk, where to put what in the stream or bus)
- implementation (tools, training, time to build)
- definition (physical, logical models, data structures, indexes, hardware, infrastructure, availability, business metadata, governance)
and there are probably a few others that i missed (like data management, etc…)
original inmon data warehouse framework definition
build a central data warehouse, process your business rules (change/alter/cleanse) your data up-stream of the central data warehouse, don’t store everything from the sources, then move your data downstream to data marts. the data warehouse structure was setup to be a variant of 3nf.
original kimball data warehouse bus architecture definition
build a set of consolidated data marts, call that your data warehouse, and process your data through business rules on the way in to the “data warehouse”, don’t store everything from the sources, then allow business users direct access to the data warehouse (which doubled as data marts because of the structure of the data warehouse (star schema).
what’s new? what’s changed?
there is a time and a place for each architecture, however with dw2.0, bill inmon has changed his definition of what a data warehouse is… in fact, customers have changed their definition of what a data warehouse is and should be. why? regulations, laws, value of data as an asset, value of bad data as an asset to perform gap analysis against business processes.
bill inmon defined dw2.0:
dw2.0 is a framework (just like the cif is a framework) – neither of these tell you how to build, they only tell you what to build, and what you need to feed it – in dw2.0, there is a data bus – why? because of the real-time components/real-time flow of information in to the dw2.0 framework. dw2.0 also says: in order to be a dw2.0 data warehouse you need accountability, auditability, temperature based data sets, metadata (in your warehousing solution), and the ability to load/use/apply unstructured and semi-structured data sets.
furthermore, the definition of the data warehouse now says: load all the data (good/bad/ugly) to the data warehouse, move your business rules downstream on the way to the data marts. why? because of auditability requirements, and real-time inflow requirements.
this is the new school of thought from bill, which is why the data vault model is a great fit for this architecture, and also why bill himself endorses the data vault model and methodology as the optimal choice for dw2.0 framework implementation. furthermore, scott ambler (one of the founding fathers of the agile manifesto also backs the data vault methodology as agile).
ralph kimball’s definitions have changed:
ok, the bus architecture is still the bus architecture, and again, the bus architecture really serves as a framework – it tells us what the components are that we need to build. dr. kimball goes on to define how to build your “data warehouse” as a consolidated/federated set of star schemas. he still maintains that going in to the star schemas you need to run your business rules first.
so when auditability became an issue, and raw data was needed as snapshot over time, what did he do? he introduced what he called “persistent staging areas” or “historical staging areas”. well, i’m here to tell you this: the minute you put history in a staging area, it is no longer a staging area, it becomes a raw-source driven data warehouse. why? because history demands that you attend to it, manage it, coalesce it… what he’s in effect said is: build a “raw data warehouse upstream of your star-schema or business driven data warehouse”.
in other words, the definitions of where to build a data warehouse have aligned between both bill and ralph. now, there is still disagreement as to how to model it, and what it’s called, but in reality, the psa is a data warehouse! it has: time-variant, non-volatile, raw source data. and if it’s modeled in 3rd normal form-ish then it is truly an old-school inmon warehouse.
if it’s modeled as disparate staging tables (separate, non-integrated snapshots of source data), then it is still time-variant, non-volatile, raw source data – just not integrated. it is still an old-school inmon style warehouse.
so why fight over the definitions? there is no need to do so.
ok – if this is true, how and why do i need a data vault?
let me see if i can make some sense of this: the data vault model and methodology, architecture, and implementation are separate pieces to the puzzle. the data vault model is a standardized hub-and-spoke like model that offers integration by business key. it is based off of best practices found in both dimensional modeling and 3rd normal form (various normalization levels from 1st to 5th).
it is a hybrid data model, built to overcome the shortcomings of each data modeling technique from a data warehousing data storage perspective.
the methodology component is the implementation guide, the how-to, the standards, the best practices, the risk mitigation strategies, the project plans, etc… it ties the framework you are using to the design of the data warehouse and the components that you build.
the data vault model itself ties your business processes to your data set, and allow you to “search” for the gaps between what is being collected on the source systems, and how you believe (your perception of) your business process to be operating.
the data vault architecture plays a role incorporating scalability, big data, nosql platforms, and real-time feeds.
the data vault implementation best practices sit within the methodology, and help establish repeatability, consistency, scalability, and automation / generation of your data warehouse.
in the new inmon dw2.0 framework…
the data vault model is the data architecture to be used for how to build your enterprise data warehouse. the data vault methodology gives you the implementation best practices, how-to do it properly. notice that the methodology dictates that you move the business rules down-stream to between the data warehouse and the data marts.
in the new kimball bus architecture / framework…
if you really insist on eliminating a staging area and instead having persistent staging (really a data warehouse) then: modeling it in a data vault model makes it a perfect fit for raw source system data, historically captured, non-volatile, accountable, yet integrated by business key. at that point, your “old-style kimball conformed data warehouse” is really a “data mart” – down-stream of the business processes… in other words, the data vault model in the psa becomes your true accountable data warehouse, and your down-stream conformed dimensions become your “data mart layer”.
if you really want to call it a data warehouse, then call it a business information warehouse (biw) – because the data in this layer has been munged, changed, cleansed, altered – it is really for business information.
but you no longer have to store 100% of the history in the biw or data mart layers… but i digress.
the point is: the new kimball bus architecture says: must be able to load real-time data, must be capable of storing a copy of the source system raw data in historical staging areas (really a raw data warehouse), must still provide business with answer sets they need to get their jobs done… ok – biw or data marts, fair enough.
the last question then becomes: for kimball bus architecture or the new inmon architecture, do you still need/want a raw non-persistent staging area? you might, especially if you have external data feeds, unstructured data feeds, semi-structured data feeds, xml or cobol data feeds, normalization, alignment, or other hard-functions to apply.
so you see, in the end – it doesn’t matter what the labels are of the components, there are 3 basic components necessary in both architectures:
- a raw non-persistent staging area
- a raw, integrated by business key, data warehouse (which i advocate a data vault model for)
- a munged/merged/cleansed view of the data for business – turning the data in to information. you can call it “data marts” or you can call it a business information warehouse (no i don’t mean ibm’s definition of the term, nor sap’s definition of the term). you’ve heard me refer to it as: business data vault (when it’s modeled in a data vault style model).
either way these components are necessary for a successful enterprise view of your data.
hope this helps, and as always, if you have questions, thoughts, or comments about what you believe to be the kimball or inmon styles, please feel free to add them to this post.
hope this helps clear the air a little bit,