i received a lengthy comment recently from a gentlemen kind enough to provide some great insight and feedback on how to get star schema data warehouses to perform. he politely asked me to respond to his comments, which i will take the time to do here in this post. this post is more on the subject of using a data vault model (as an edw) vs star schema as an edw. i wanted to say thank-you to frank for providing an honest answer. as usual, before i get going, i invite everyone to post comments and questions as we all are learning about the next step in the process of building data warehouses. also, as usual, i have said it before, i’ll say it again: if what you are doing is working for you, then that’s wonderful, don’t change or fix what isn’t broken. but if you are having problems or issues with your data warehouse, then it may be time to seek a new alternative approach such as the data vault model and methodology.
first a disclaimer:
frank was very nice, and professional in his comments. i mean no disrespect to him, his knowledge, or the system that inergy uses, in fact, i truly appreciate the insights. however, i sometimes get overly excited in making a point, and my writing can be “rough” around the edges. this roughness is not directed at frank or anyone else, it just happens to be my writing style (i write like a teach… with passion – i hope). as always, take the parts you like, and leave the rest. i believe frank to be a very smart individual with valid opinions, and great questions. thank-you for taking the time to write a lengthy comment – i enjoy reading comments like this. i believe in what i am doing, and it comes through in what i write and teach.
i just read that you like to hear from me and our company inergy, because we experience a lot of success with star schemas in terabyte environments, with multiple sources, including (near-)real-time-loading and you want to know what we’ve done that works.
our architecture is used in our “bi in the cloud”-solution in the netherlands. this is a true biaas environment with bi-paas (technology: netezza, powercenter & microstrategy) including development, maintenance, system management and support of the dwh. in other words, the complete ‘package’ for a data warehouse environment. this environment is almost 4 years in place and we serve a lot of customers.
let me first start to explain where we agree: a good backend system and a good frontend system are required in an enterprise data warehouse. in the dv-architecture the combination of the dsa, raw dv and business dv is the backend system (in my definition) and the combination of data marts is the frontend system. we also agree on the requirements of the backend system and the frontend system:
– backend system: system-of-the-records, complete history (which leads to an auditable en compliant dwh), ability for real-time loading, preventing cascading-update-impact, easy to load, easy to build, easy to extent, easy to restart and easy to scale.
– frontend system: an easy-to-use environment for the end user, e.g. dimensional model.
we also agree that both the architecture of inmon and kimball do not meet those requirements.
yes, we agree on a good back-end system needing to be enterprise wide, scalable, flexible, and housing raw auditable data. however, i have never said the dv architecture is a combination of dsa (data staging area), raw dv, and business dv. this is an incorrect assumption. the term: raw-dv is what i call an enterprise data warehouse, and is comprised of business keys, and soft-integrated raw data. this is much different than a “raw-dv” as the term is used by so many people out there. see my post (with ronald damhof) on ronald’s site about raw-dv definitions.
the business dv is not a back-end system, nor is it meant to be a back-end system. business dv is another data mart, just modeled (similarly) to the dv modeling structures using hubs and links. the business dv is a front-end delivery mechanism, so please don’t mistake it for a back-end system. furthermore, i never advocated the use of a business data vault as a required part of the architecture, in other words: it’s optional – if you want to implement a business dv then you have pain-points / needs / reasons / requirements to do so.
you see, i’m a minimalist, i believe you must have business justification for everything that you build, regardless of what is being built. that makes the business dv optional.
front-end system: i do not define a front-end system as limited to a dimensional model. i make the statements as follows: the front-end system will be data marts, let your data marts be modeled specifically for the purposes of rapid business retrieval and understanding. the data marts may be a star-schema dimensional model, a cube, a flat-wide file, or a business dv model – it all depends on what the business users need. data marts may also be virtual in nature, and should be “virtualized” or stored as views until such time as performance dictates otherwise.
a question about your statement: “we also agree that both the architecture of inmon and kimball do not meet those requirements.”
i’m not sure i do agree, because the previous bullet point you proposed (front-end system) was strictly a kimball architecture (dimensional)… by the way, dr. kimball did not invent the dimensional model, he did however make it popular, define it well, and formalize it as a method for data modeling. let me change your statement to the following:
we also agree that both the architecture of inmon and kimball do not meet the requirements of back-end data warehousing.
my concern with the dv-architecture is the usage of a lot of storage layers (dsa, raw dv, business dv, data marts). let’s face it: every layer needs to developed, maintained, managed, the data has to be processed to every layer, (dv-)knowledge is required and of course the data needs to be stored. this leads to additional costs and a longer time-to-market. i think your argument against this firm statement is that the dv-architecture breaks complexity down in to bite-sized manageable chunks (divide and conquer). if i compare this to our solution (see below), i really don’t recognize this advantage and if there is a (small) advantage, the disadvantages of the additional storage layers are (in my opinion) much bigger. bottom line, my concern with the dv indeed is yals: yet another layer of storage. in other words, the dv will lead to a good enterprise data warehouse, but it takes (in my opinion) more time than needed.
first, i’d like to say this: i have taught this in class in the past… the dv has no more “storage layers” than a traditional enterprise data warehouse. there are: 1) the stage, 2) the edw, 3) the data mart (which in my opinion should be virtualized, see my comments above). now, in certification class (in person) i teach the following: the future, for real-time data warehousing, actually makes the staging area obsolete! i maintain that in a truly real-time data warehouse (operational data warehouse) that there will be no need for a staging area, why? because the data is arriving in information queues (message queues) – and processed directly in to the data vault as-is. at that point, there is no-time to stop and process multitudes of business rules to conform and align the data set to business needs.
every layer needs to be developed, maintained, managed…. agreed! again, i am a minimalist – i’ve always taught that: only build what is absolutely necessary and can be measured as tangible business value. if it has no business value (including maintenance costs, complexity ratings, ease of use, flexibility, etc…) then don’t build it. it also needs to be justifiable and measurable in order to assign a quantitative value to it. dv knowledge is required – yes, true, very true… but it’s an evolutionary process to think in the dv fashion. take everything you know, combine it (dv is a hybrid data model built off the best of breed from 3nf and star schema modeling techniques) – there really isn’t anything “new” to learn, just different ways of thinking of data architecture.
if i compare this to our solution (see below), i really don’t recognize this advantage and if there is a (small) advantage, the disadvantages of the additional storage layers are (in my opinion) much bigger. bottom line, my concern with the dv indeed is yals: yet another layer of storage. in other words, the dv will lead to a good enterprise data warehouse, but it takes (in my opinion) more time than needed.
i can see your point, and we’ll discuss the comparison below. regarding the advantages… if you’ve never been through certification training, then i agree, it would be hard for you to see the advantages and disadvantages of using an enterprise data warehouse architecture. keep in mind, it is not a data vault architecture – the data vault is a modeling technique for your enterprise data warehouse layer and a methodology for how to implement an edw properly. do not confuse the data vault with an enterprise data warehouse architecture. the edw architecture is a systems architecture, the data vault is a modeling paradigm and implementation methodology that fits in your enterprise data warehouse architecture (systems architecture). when you examine the definition of my proposed edw systems architecture, i talk about the following:
staging area -> edw (raw data vault) -> data marts (including star schema, cubes, flat-wide files, business data vault, etc…)
the only difference in “storage areas” that i can see between what you say inergy has, and what i discuss is the staging area – and what we call a data warehouse. these definitions are different. (we’ll get to this).
ok, i think you’re now curious to the inergy-architecture. well, this architecture is quite simple (and not unique): our backend system is a historical(!) data staging area. this historical dsa has the same structure as the source, including start- and end-timestamp. we generate the process from source to dsa 100%, including delta detection, transport and archiving. the frontend system is a (traditional) dimensional dwh storing the data to the lowest grain. physically it’s one database with dozens of conformed star schema’s. the business rules are implemented from dsa to dwh.
ok, i get it… one of the disagreements you and i have is over what we “define” to be a data warehouse. i maintain the following stipulation: (and i define it in my book): a data warehouse is defined by: non-volatile, time-variant, integrated, information. i also include: raw data in that definition, and if you are building a data vault: integrated by business key is also included in the definition.
i would argue that your historical staging area is really a raw data warehouse – however, it is not integrated by business key. the minute you put history in a staging area, it becomes a data warehouse – and requires all the maintenance, overhead, management and everything else that a “data warehouse” requires. such as performance, up-time, partitioning, parallelism, flexibility, scalability, auditability, etc… i’ll agree that a historical dsa is a data warehouse. i will also say: a historical (non-integrated) dsa is not an enterprise data warehouse. why? because it is not integrated by a horizontal view of the business keys across the different lines of business. the data (as you pointed out), is the same structure as the source.
some questions for you, that i may be ignorant on the answers due to lack of visibility:
- if the historical dsa is the same structure as the source, how do you handle cobol feeds?
- if the historical dsa is the same structure as the source, how do you handle xml feeds?
- if the historical dsa is the same structure as the source, how do you integrate “unstructured/semi-structured” data feeds?
- if the historical dsa is the same structure as the source, how do you handle “real-time message inflows”?
- what happens when a source system is “retired” and a new one takes its place? how do you load new data from the new system, what do you do with the old data from the old system in the historical dsa?
- how do you recover from issues/problems if the historical dsa gets’ “out of sync” with the source system?
- how do you handle 100% restartability / recverability of your loads to the historical dsa if they fail? do you use database logs from the historical dsa?
- what happens if the constrains in the source are defined, but broken? how does “broken data” get stored in the historical dsa? i guess this assumes that you have all the fk’s, pk’s and constraints applied in the historical dsa…
- what if you have null pk’s in the source, and you need to report all the broken data to the business, can this “busted data” make it in to the “exact replica historical dsa?”
- how do you handle changes to the source systems structures in the historical dsa? doesn’t this cause a cascading impact against the children? especially if time is added as an attribute to the pk?
don’t get me wrong, i’m not saying that a historical dsa is a bad thing… that’s just a concept… i’m debating the value of modeling the historical dsa according to the source system, i’m debating the value of storing history in a staging area in the first place, why? because the next thing you say inergy does is: “copy all that history – run it through business rules, and store it again in what i call the data mart layers (your federated star schema).” that’s two copies of the data (in full)…. ouch.
in the data vault methodology, when we load data marts (because the staging area has no history, and the edw -dv has 100% of the history in one place), we can put rolling history in the data marts. we can limit the amount of history in our data marts to just what the business needs and can justify. this is not something that i can see is available given the architecture of inergy. by using the dv methodology, our overall storage costs are actually lower than the architecture you propose (we use less overall storage) – because we have no need to store all the history twice.
to your credit (or whomever built the system you have):
they federated star schemas were done properly – at the lowest level of grain. i’m assuming this is the case based on what i’m reading. i’ve run in to so many “dimensional data warehouses” that are not stored at the lowest level of grain, that they are the cause of missed auditability, missed opportunity, and can’t provide new requirements/requests to business users on demand. i will also say, that from the sounds of it, your business never really “changes” too much. in other words, they don’t acquire new companies, or split off parts of the company – it also sounds like they aren’t replacing systems too often, and that it’s a somewhat narrow line of business (easily defined by finite mathematics). in this case, it is easy to “stabilize” dimensions, and not try to overload the dimensions with other components and multiple definitions.
the dimensional data warehouse models i run in to are quite simply put: a mess. their dimensions have been overloaded, abused, and continually “added to” until they simply can’t hold any more data (due to a number of reasons). also, they have at least 3 if not more, definitions per field in the dimension… ie: when the moon is blue, the customer number means x, when the customer name starts with a “*” then the customer number is not a customer, but a prospect… etc… they can’t keep it straight, let alone combine new data sets from new source systems. the maintenance costs become astronomical at this point.
but i will say, that it sounds like inergy’s system is tight, and well-designed. my congratulations to you and the team that built it. it’s not this way in the environments i visit. it usually takes a really good data modeler to build the right dimensionally based data warehouse – the first time out of the gate.
don’t we have data marts? well, only when required. it is required only in exceptional situations, e.g.:
– kpi-applications: a specific star schema with kpi’s is required, derived from the dwh star schema’s. but this kpi-schema is stored again in physically the same database with conformed dimensions to the dwh, so it isn’t a data mart from a user perspective, only from technical perspective (loading and storing the same data again).
– for data mining purposes: e.g. for marketing analytics for each customer one record with a lot of characteristics is required. again, this data (mart) can be stored in the same physical database (netezza supports in-database analytics).
so, data marts are the exception instead of the default, resulting in a minimum of storage layers. our experience is that 95% of is not copied to a data mart.
the exact same statements go for the data vault that i propose… don’t build the data marts unless they are justified. furthermore, build the data marts as virtual layers (in ram, or sql views) until such time as the business rules are too complex, or the performance is not good enough, then decide where to physicalize the tables. we also only build data marts when required, so on these points we agree, and are the same.
the fun is, this architecture meets the requirements mentioned above:
– backend system: system-of-the-records, complete history (which leads to an auditable en compliant dwh), ability for real-time loading, preventing cascading-update-impact, easy to load, easy to build, easy to extent, easy to restart and easy to scale.
– frontend system: an easy-to-use environment for the end user, in our architecture a dimensional model.
the big advantage of our architecture: only two required storage layers with one very simple layer: the historical dsa. this layer is 100% generated and additional knowledge is not needed, it’s just a historical copy of the source system.
ok – i don’t see this as an advantage. as i explained earlier, the enterprise data warehouse systems architecture contains a data vault model and methodology as one component of the architecture. the staging area does not store history (it is not a historical dsa in the world of data vault implementation), the dv model (your edw) does store history, all of it. the data marts (data delivery layer) should be virtualized, and when it is physicalized it stores only rolling history, or limited history – not the entire history that is contained in the data vault. so, compared with your two layers (which i read to store all history twice), the approach that is recommended by the data vault model and methodology actually requires less disk space overall. so i’m not sure where your claim to “less storage” for your architecture is coming from, forgive my ignorance.
using this architecture i really don’t see the complexity of the load routine from dsa to dimensional data warehouse. a unit is loading a fact or dimension table. in 95% this could be loaded easily. only in exceptional situations (< 5%) we take an additional step (storing the data twice in de dsa), e.g.: for matching and deduplicate customer data. so, we use the same architectural principal: only an additional storage layer when required.
of course, we have the benefit of the power of a data warehouse appliance: no aggregates, no partitioning, no indexes, high performance loading and querying. in other words: a data warehouse appliance not only boosts your load- and query-performance but it also allows a very straightforward architecture. that’s why i recommend every dwh-architect of a medium or large volume data warehouse to use an appliance.
ok – we have a similar approach, apply business rules before delivering data to the business user, as well as “separate” the job of sourcing the data from the source systems (timing, availability, getting the data in), from the job of applying business rules. in that regard, we are matching in beliefs. in the world of data vault, we accomodate real-time delivery direct to the data warehouse (a term which we define differently), from cobol, xml, and unstructured data sets – along with batch cycles, we load a non-historical dsa, fully recoverable, fully fault tolerant, has no constraints, no foreign keys, and therefore allows the “good, the bad and the ugly” data to make it all the way in to the data warehouse/data vault. i fear with a 3nf historical dsa that you are discussing, and the questions i asked above, shed light on the fragile nature of “loading 100% of the data within scope, 100% of the time” making the approach that inergy takes, less likely to pass a full and complete audit (especially given that the exact source system structures are copied to the historical dsa), but that’s just my opinion.
appliances, appliances — it doesn’t matter what type of data model or systems architecture you have (data vault or not…) i have customers using netezza and data vault, just as successfully as dimenisonal models and data vault. now, some appliances (like columnar data stores) don’t care what model you throw at it. the physical storage components vertically partition the entire model, there is no physical concept of “table” in a columnar db… just a logical definition of a table structure (made up of columns). in neteeza, it likes flat-wide structures, so…. we model the logical data vault, and the physical dv model denormalizes some components to make it work better with the box. appliances don’t matter in this discussion one way or the other. i can say the same thing about teradata boxes and teradata appliances, or greenplum – no partitioning, no aggregates, no indexes, high performance loading and querying, etc… it’s not about the appliance, our discussion is about the merits of the systems architecture, and the data model design.
we don’t have to discuss the strengths of a dimensional model for end users, because a dimensional model is also part of the dv-architecture. you mention some disadvantages of the dimensional model, but this is from the perspective if the dimensional model is the central (auditable, historical, system-of-facts) data warehouse. this dwh is in our situation the historical dsa. in other words, the dsa has got some additional purposes compared to the traditional dsa.
finally, i already want to respond on some -possible- reactions and concerns:
– “with this architecture you don’t have the ability to drop the data mart (= star schema in the data warehouse), because you will lose your business rules.” that’s correct. but, our experience is that we never have to drop data marts, why should we? dropping a data mart is in my opinion really an exception, the facts are designed around the process, and processes don’t disappear often. beside this, i disagree when people talk easily about dropping data marts, because this is the product which the business uses. this is why they pay the bill. moreover, a lot of reports are based on the data marts, it isn’t easy to tell the business: we’re going to drop your reports!
[ my response: business changes. when business changes, they demand different results, when they demand different results – they want different structures. when they ask new questions, they want new reports, new data sources. sure there are some core functionality reports, and those marts stay in tact (most of the time), but there are times when business needs a different grain of data, and you can’t simply run off and add a new dimensional key to a fact table, this would destroy the credibility of all the existing facts in that table (unless they were re-built using the new grain). we resolve to build new marts for different business needs. sure there are times when we add columns to existing dimensions and facts, but only when it fits. if inergy’s business is not changing, then either it is a federally operated business (government owned), or it is a dying business because of lack of competitive nature, and lack of change to asking new questions. i once had a case where auditors needed to review data once a year for 3 weeks, when they left, they no longer needed the data – so we dropped the data marts, and re-built them the next year (with new requirements). i’ve had other, one-off requests: please run this report/answer this question for this months reports only….]
– “but with the dv you can easily create the dimensional data mart, that’s not the case with your architecture”. that’s correct, but in the dv-architecture the hard work is done from dsa, to raw dv to business dv. so, the complete data logistic must be compared, not only a part of the data logistic.
[my response: no. you have an incorrect assumption. the hard-work as you put it, is done going from the data vault edw to the data mart layers. your statements that it is done going from dsa to raw dv is incorrect, going from raw dv to business dv is only correct if you build a business dv, and don’t forget, the business dv is just another data mart. i do not understand your statement: so the complete data logistic must be compared… if you mean complete history, then you are incorrect. going from the stage to the raw dv deals with delta processing (we do not advocate nor use a historical dsa), if you mean from the raw dv to the data mart layer, then that is also incorrect. i have implementation designs which allow incremental build outs of the data mart layers… remember: scalability is a fundamental tennant across everything in the enterprise systems architecture that includes a data vault.]
– “but with the dv you can use virtual data marts, so that’s one storage layer less”. i think you agree that the performance will be not optimal, because of the additional required joins (especially between big tables: link-satellite). you will almost in all situations materialize those views. besides this, you still need to develop and maintain this layer.
[my response: performance depends on the environment, the hardware, the partitioning, and the size of data you are trying to retrieve. i cannot and will not agree with the statement: performance will not be optimal… i have customers with data vaults on teradata, where the join performance is exceptional. i have other customers on sqlserver 2008 r2 enterprise where the join performance is exceptional, and i have other customers on oracle, db2 udb, netezza, and even sqlserver 2005 where the join performance is exceptional. i’ve been doing performance and tuning as part of my career for systems like nike, aaa, pepsi bottling company, nationwide insurance, and expedia, and so on for over 20 years. i completely 100% disagree with your statement that: “you will almost always have to materialize these views” – this simply couldn’t be further from the truth. in fact, it isn’t true. i have customers using virtual data marts today. besides which, you still have another layer to develop and maintain: the inergy system must develop and maintain conformed dimensions and facts, and what happens when these don’t perform? what if the business wants a cube? again, the architectures are the same, both demand that data marts be made available, and both can virtualize them – performance is another story that all depends on needs and desires, and hardware layers.]
– “you don’t have the flexibility when the source changes”. well, we only have to change the dsa-tables. our experience is that in 95% of the cases this means adding an attribute. that’s no problem. and if the source drastically changes, ok, in that case we have to change also change the etl to dwh and dwh schema. but, only two layers have to be changed. bottom line: a flexible architecture is more important than a flexible data model.
[my response: as i said earlier, congratulations on a stable, static dimensional model. the only time i’ve ever seen a stable dimensional model is a) when the business is government based and doesn’t change much, or b) it’s a dying business – not being competitive in the market place, or c) doesn’t have any real competitors to what it does, so it becomes lazy, and “does business they way it always has, because it feels like it works.” are you overloading the definitions of fields in the dimensional model? what happens when new “cases” come in from the business? what happens when a new source system arrives with data that is defined differently?
ok – i want to clarify: the data vault model is an architecture. it is a data architecture. it is flexible. how are you defining flexible architecture? when i look at a systems architecture i speak about the major components that need to be in place: staging, edw, data marts. these components don’t need to be flexible – i’m not going to “wake up one morning” and invent a new layer for the systems architecture, so in that manner of speaking the systems architecture needs to be stable and strong. stand the test of time. when you say: flexible architecture more important than a flexible data model, you have contradicted the previous sentences… where you state you only have to make changes to the “structures in the dsa…” well, that means you’re making changes to the data model does it not? so, the data model does have to be flexible does it not?
in the enterprise systems architecture world using a data vault approach, we too can easily absorb changes. adding a new system is simple: add the new staging tables (also 1 for 1 with the source system, but the tables do not contain history, nor do they contain pk or fk’s). we then add the columns, or structures (hubs, links, sats) to the data vault model (edw) as necessary. sometimes it is not necessary. then, only on demand from the business – do we add them to the data mart or data delivery layer. this could be virtual or physicalized.]
summarized: what’s our ‘magic’? well, it isn’t magic, it’s just practical approach: a historical dsa, conformed dimensional model in one physical database, additional storage layers only when required, a dwh appliance, generate as much as possible and a lot of standardization. really, every day we experience this architecture works! i really don’t recognize your warnings: ‘mass complexity’, ‘hits you like a ton of bricks’ and ‘major re-engineering’. honestly, if that would be the case, inergy wouldn’t exist anymore!
[my response: again, i want to say: congratulations on a well-founded, well-designed system. inergy has a system that works and it’s based on dimensonal modeling at the lowest level – that’s wonderful; and again i want to emphasize to my readers and everyone out there: if you don’t see or feel pain, then don’t make changes – the dv model and methodology may not be a fit for you if what you have is working. but when you start to experience pain, problems that you are seeking a new and different approach. it’s at that time that the values and benefits of the dv begin to make sense. it’s also when the dimensional modeling system has been poorly constructed to begin with, or the business is going through rapid changes, that’s also when the dv approach makes sense.]
i want to emphasize that i appreciate and acknowledge your in-depth dwh-knowledge (and the knowledge of your ‘dv-colleagues’ in the netherlands), but apparently we don’t agree on ‘the best’ dwh-architecture. i hope you can tell me what the advantages are of the dv compared to our architecture. i’m looking forward to your reaction, because i want learn from your experience as you want to learn from others! another reason is that i’m currently writing an article for a dutch online architecture magazine which describes this alternative architecture.
[my response: thank-you for your kind words and in-depth discussion. i thoroughly enjoyed the tough and well thought out questions/challenges you put forward. i also enjoyed hearing about a successful dimensional data warehouse (depends on how you define success, i know….) it’s just very rare that i see companies in action with a dimensionally based data warehouse that is working for them, and not causing problems – perhaps it’s because of the line of work i’m in, perhaps it’s because i’ve been a “trouble shooter” all my professional life, whatever the reason – again, congratulations. they are well deserved.
i want to emphasize that in no way shape or form did i ever intend to bash you, or your systems – i am excited, and over-zealous (just a bit) in my responses, and if anything i said offended you, my apologies. i hope to have more discussions like this one in the future, cheers]
if anyone has any questions, please don’t hesitate to post comments, or send me an email. do others have success stories for dimensional warehouses they’d like to share? i’d like to hear about them.