Data Vault Vs Dimensional – Part 3

Welcome back to part 3 in the series.  In this entry I will begin comparing and contrasting Star Schema modeling to Data Vault modeling – from a DATA WAREHOUSE perspective, NOT as an end-user data mart accessible model.  I’ve said it before, I’ll say it again: Star Schemas, Cubes, and Flat-wide tables are the best for delivering data directly to end-users.  The Data Vault is the best for your Data Warehouse environment.

Let me back up…

First and foremost, I separate the Data Warehouse from the Data Marts.  My definition of Data Warehouse follows Bill Inmons’ definition of Data Warehouse:  reliable, time-variant, consistent, integrated, etc…  Where I define DATA WAREHOUSE differently than the industry is:  I believe a Data Warehouse should have the following additional characteristics:

  1. easy to architect
  2. easy to build
  3. easy to extend (flexible)
  4. easy to scale (# of tables, and in data size, and in loading patterns, and in querying patterns)
  5. easy to load – SIMPLE, pattern based, repeatable loading processes
  6. NON-QUERYABLE BY END-USERS

So where are the differences between Dr. Kimball and you?

Well hey, something’s got to give somewhere, and today it is near impossible to make a model that is both easy to query by end-users, AND have all the other characteristics that a warehouse needs!  There are two fundamental differences between my definition of “Data Warehouse” and Dr. Kimball’s definition of Data Warehouse:

  1. Dr. Kimball believes that the Data Warehouse is a LOGICAL collection which includes the staging area and a multitude of star schemas or data marts (federated or not). 

    I believe the Data Warehouse is a PHYSICAL element that stands on it’s own, has it’s own architecture and has it’s own data model between the staging area and the data marts.

  2. Dr. Kimball believes that the Data Warehouse should contain consistent, quality cleansed, and business aligned data.  I believe that is NOT the job of the data warehouse, but rather the job of the Data Marts. 

    I believe that the Data Warehouse is a raw system of record that is integrated by business key, but that is all.

So why the split?  Doesn’t it introduce Yet Another Layer of Storage (YALS)?

Some might see it this way, and physically speaking, yes – the answer is YES of course it does.  Most people automatically jump to the wrong conclusions at this point.  Unfortunately because of these false beliefs and incorrect assumptions, the Data Warehouse never “grows up”, never makes it to the “standardization” area, never reaches scalable patterns and so on.  Why?  Because people insist on banging their head against a brick wall!

  • They know that a certain level of star schemas (federated) won’t grow or scale beyond a certain point.
  • They understand that adding true real-time feeds along side the batch feeds causes problems with star schemas (empty dimension records, updated dimension data, missed key structures, mismatched key structures, etc…)
  • They feel the pain of trying to change a non-standard process 3 months in to production
  • They’ve been in a postition where they have been outsourced
  • They see the business users create their own solutions

And yet, even with all this (and more), they still insist on creating federated star schemas as their data warehouse.  They swear that it’s easier, cheaper and faster!  Then, 6 or 9 months down the road in to production, they smash head-long in to the brick walls above. 

  • They can no longer maintain the growth of the data warehouse
  • They can no longer make changes to already too complex loading routines
  • They can no longer add new systems, or new data to the already too wide dimensions
  • They can no longer be agile enough or responsive enough to the business needs for new information
  • They can no longer grow the system because of query performance problems

With all these problems, why do they persist in insisting that Star Schemas make a good Data Warehouse?

I really can’t answer that question because I’m not sure why.  But I will say this, if you’ve STILL not yet tried the Data Vault model, then you truly are missing out.  I would encourage you to build 2 Hubs, 1 Link, and 3 Satellites (at a minimum), and give it a shot!  What have you got to lose?  Nothing.  Only the world to gain, and if you do it right, it should bring you ease of use, ease of scalability, flexibility, and rapid build out of data marts along the way.  If you end up not liking it after 6 months, then simply drop the tables – and NO HARM DONE!

Of course, if you chose to NOT go the Data Vault route, I’d like to hear why you chose differently – what caused you pain or problems?

All that said, I still don’t understand WHY I should have YALS (yet another storage layer)?

Fair enough.  Let me ask you this: 

In life, when you’re faced with a huge problem to solve, what’s the FIRST THING you do in order to begin constructing a solution?  If you’re like most people, you would have said: I break it down in to bite-sized managable chunks, so I can solve small pieces of the puzzle – in order to solve the entire problem.

Why then, do you insist on solving the huge problem of data warehousing in one step?  Generally, everyone I’ve met who builds the data warehouse according to the OLD SCHOOL of thought, likes to solve ALL of the following problems before the data hits the staging area:  (all in a SINGLE ETL loading routine)

  • Source system timing and availability (logon to multiple systems at once to fetch all the data in one read cycle)
  • Cross-system dependency and sourcing (filters, joins, aggregates)
  • Business rules – mid stream, to change, fix, align the data before landing it in stage
  • Quality Cleanse and Merge – mid-stream before landing it in the stage
  • Fast Load to staging areas
  • Parallelism & Partitioning for speed
  • Indexing – which often times, YOU CAN’T CHANGE on the source systems!
  • Restartability – the ability to “have the load pick up where it left off” simply by restarting it
  • Scalability – the ability to have the load run with hundreds of millons of rows, and do it fast
  • Scalability – the ability to “add another system” in to the existing process with little to no impact

WHY?  Why? why?  Why do people insist on solving ALL of the problems from data, to architecture, to systems design in a SINGLE STEP?

The end result is: MASSIVE COMPLEXITY.  Ok, maybe not with the first star you build, but when you add a second source system, a third, a fourth, a fifth – or you begin adding 10 to 15 more conformed dimensions and a few more source systems – this is when it HITS YOU LIKE A TON OF BRICKS! 

This is when you’re original estimates of 3 months, turn in to “6 month and 9 month turn around times for the business”.  This is when your system breaks every time you touch a loading process upstream, or every time you change the data model, there is HUGE impact requiring MAJOR re-engineering.  This is when the business finally gives up and says: ENOUGH, I’ve had enough, you cost too much, and you take too long. I will now outsource the data warehouse, or I will shut it down and re-start it from scratch, or I will build it myself…

You’ve all been there, I know! If you’re not there, and you’ve experienced successes with the Star Schema in terabyte environments, or with a lot of source systems, or with real-time, then I’d love to hear from you.  Please enlighten me as to what you’ve done that works, so that I may learn how you succeeded.

Ok, back to the Data Vault… In the next entry, I’ll cover what it brings to the table to help you solve these problems, and break the cycle!

Cheers,
Dan Linstedt
PS: Inside the coaching area, I teach you HOW to solve these problems, step-by-step.

Tags: , ,

3 Responses to “Data Vault Vs Dimensional – Part 3”

  1. Frank Habers 2011/08/01 at 4:59 am #

    Hello Dan,

    I just read that you like to hear from me and our company Inergy, because we experience a lot of success with star schemas in terabyte environments, with multiple sources, including (near-)real-time-loading and you want to know what we’ve done that works.

    Our architecture is used in our “BI in the cloud”-solution in the Netherlands. This is a true BIaaS environment with BI-PaaS (technology: Netezza, Powercenter & MicroStrategy) including development, maintenance, system management and support of the DWH. In other words, the complete ‘package’ for a data warehouse environment. This environment is almost 4 years in place and we serve a lot of customers.

    Let me first start to explain where we agree: A good backend system and a good frontend system are required in an enterprise data warehouse. In the DV-architecture the combination of the DSA, raw DV and business DV is the backend system (in my definition) and the combination of data marts is the frontend system. We also agree on the requirements of the backend system and the frontend system:
    – backend system: system-of-the-records, complete history (which leads to an auditable en compliant DWH), ability for real-time loading, preventing cascading-update-impact, easy to load, easy to build, easy to extent, easy to restart and easy to scale.
    – frontend system: an easy-to-use environment for the end user, e.g. dimensional model.

    We also agree that both the architecture of Inmon and Kimball do NOT meet those requirements.

    My concern with the DV-architecture is the usage of a lot of storage layers (DSA, raw DV, business DV, data marts). Let’s face it: every layer needs to developed, maintained, managed, the data has to be processed to every layer, (DV-)knowledge is required and of course the data needs to be stored. This leads to additional costs and a longer time-to-market. I think your argument against this firm statement is that the DV-architecture breaks complexity down in to bite-sized manageable chunks (divide and conquer). If I compare this to our solution (see below), I really don’t recognize this advantage and if there is a (small) advantage, the disadvantages of the additional storage layers are (in my opinion) much bigger. Bottom line, my concern with the DV indeed is YALS: Yet Another Layer of Storage. In other words, the DV will lead to a good enterprise data warehouse, but it takes (in my opinion) more time than needed.

    Ok, I think you’re now curious to the Inergy-architecture. Well, this architecture is quite simple (and not unique): Our backend system is a historical(!) data staging area. This historical DSA has the same structure as the source, including start- and end-timestamp. We generate the process from source to DSA 100%, including delta detection, transport and archiving. The frontend system is a (traditional) dimensional DWH storing the data to the lowest grain. Physically it’s one database with dozens of conformed star schema’s. The business rules are implemented from DSA to DWH.

    Don’t we have data marts? Well, only when required. It is required only in exceptional situations, e.g.:
    – KPI-applications: a specific star schema with KPI’s is required, derived from the DWH star schema’s. But this KPI-schema is stored again in physically the same database with conformed dimensions to the DWH, so it isn’t a data mart from a user perspective, only from technical perspective (loading and storing the same data again).
    – for Data mining purposes: e.g. for marketing analytics for each customer one record with a lot of characteristics is required. Again, this data (mart) can be stored in the same physical database (Netezza supports in-database analytics).

    So, data marts are the exception instead of the default, resulting in a minimum of storage layers. Our experience is that 95% of is not copied to a data mart.

    The fun is, this architecture meets the requirements mentioned above:
    – backend system: system-of-the-records, complete history (which leads to an auditable en compliant DWH), ability for real-time loading, preventing cascading-update-impact, easy to load, easy to build, easy to extent, easy to restart and easy to scale.
    – frontend system: an easy-to-use environment for the end user, in our architecture a dimensional model.

    The big advantage of our architecture: only two required storage layers with one very simple layer: the historical DSA. This layer is 100% generated and additional knowledge is not needed, it’s just a historical copy of the source system.

    Using this architecture I really don’t see the complexity of the load routine from DSA to dimensional data warehouse. A unit is loading a fact or dimension table. In 95% this could be loaded easily. Only in exceptional situations (< 5%) we take an additional step (storing the data twice in de DSA), e.g.: for matching and deduplicate customer data. So, we use the same architectural principal: only an additional storage layer when required.

    Of course, we have the benefit of the power of a data warehouse appliance: no aggregates, no partitioning, no indexes, high performance loading and querying. In other words: a data warehouse appliance not only boosts your load- and query-performance but it also allows a very straightforward architecture. That's why I recommend every DWH-architect of a medium or large volume data warehouse to use an appliance.

    We don't have to discuss the strengths of a dimensional model for end users, because a dimensional model is also part of the DV-architecture. You mention some disadvantages of the dimensional model, but this is from the perspective if the dimensional model is the central (auditable, historical, system-of-facts) data warehouse. This DWH is in our situation the historical DSA. In other words, the DSA has got some additional purposes compared to the traditional DSA.

    Finally, I already want to respond on some -possible- reactions and concerns:
    – "With this architecture you don't have the ability to drop the data mart (= star schema in the data warehouse), because you will lose your business rules." That's correct. But, our experience is that we never have to drop data marts, why should we? Dropping a data mart is in my opinion really an exception, the facts are designed around the process, and processes don't disappear often. Beside this, I disagree when people talk easily about dropping data marts, because this is the product which the business uses. This is why they pay the bill. Moreover, a lot of reports are based on the data marts, it isn't easy to tell the business: we're going to drop your reports!
    – "But with the DV you can easily create the dimensional data mart, that's not the case with your architecture". That's correct, but in the DV-architecture the hard work is done from DSA, to raw DV to business DV. So, the complete data logistic must be compared, not only a part of the data logistic.
    – "But with the DV you can use virtual data marts, so that's one storage layer less". I think you agree that the performance will be not optimal, because of the additional required joins (especially between big tables: link-satellite). You will almost in all situations materialize those views. Besides this, you still need to develop and maintain this layer.
    – "you don't have the flexibility when the source changes". Well, we only have to change the DSA-tables. Our experience is that in 95% of the cases this means adding an attribute. That's no problem. And if the source drastically changes, OK, in that case we have to change also change the ETL to DWH and DWH schema. But, only two layers have to be changed. Bottom line: a flexible architecture is more important than a flexible data model.

    Summarized: what's our 'magic'? Well, it isn't magic, it's just practical approach: a historical DSA, conformed dimensional model in one physical database, additional storage layers only when required, a DWH appliance, generate as much as possible and a lot of standardization. Really, every day we experience this architecture works! I really don't recognize your warnings: 'mass complexity', 'hits you like a ton of bricks' and 'major re-engineering'. Honestly, if that would be the case, Inergy wouldn't exist anymore!

    I want to emphasize that I appreciate and acknowledge your in-depth DWH-knowledge (and the knowledge of your 'DV-colleagues' in the Netherlands), but apparently we don't agree on 'the best' DWH-architecture. I hope you can tell me what the advantages are of the DV compared to our architecture. I'm looking forward to your reaction, because I want learn from your experience as you want to learn from others! Another reason is that I'm currently writing an article for a Dutch online architecture magazine which describes this alternative architecture.

    Best regards,

    Frank

Trackbacks/Pingbacks

  1. Tweets that mention Data Vault Vs Dimensional – Part 3: I begin to examine why the star schema or dimensional… -- Topsy.com - 2010/12/30

    […] This post was mentioned on Twitter by Henk Binnendijk. Henk Binnendijk said: RT @dlinstedt: Data Vault Vs Dimensional – Part 3: I begin to examine why the star schema or dimensional… http://goo.gl/fb/T3X9K […]

  2. Response to Dan Linstedt: What’s the added value of the Data Vault? | BI-Buzz - 2011/08/11

    […] Linstedt wrote in his blog post Data Vault Vs Dimensional – Part 3 “… you’ve experienced successes with the Star Schema in terabyte environments, or with a lot […]

Leave a Reply

*