Facebook Twitter LinkedIn YouTube RSS
Home Data Vault #datavault: New Response to Frank Habers – Part 1
formats

#datavault: New Response to Frank Habers – Part 1

Published on 2011/09/28 by in Data Vault, News

Hello everyone, this discussion has gotten interesting, however – Frank’s latest article is for the most part completely incorrect.  I have tried to post a response to his blog and his article with not much success.  Furthermore, his article in the magazine had to be run through google translator, so I am sure that there are parts of the essence of meaning which have been lost.  Let me just start by saying the following:  he openly admits to me in writing the following: “However, I still do not understand when and why to use the Data Vault (DV) modeling technique.”

Full disclosure:

I DON’T LIKE POSTING THIS KIND OF INFORMATION – BUT IT SEEMS NECESSARY, as people and prospects are becoming misguided and confused by incorrect assumptions and statements, that are being posted by people who have NEVER been certified, and have NEVER built an approved/audited Data Vault model in their life.   MY APOLOGIES to my regular readers, I will return to the regularly scheduled “information about Data Vault” very soon.  I just feel the real facts need to be made known.

Frank is a vendor of a BI Solution as a service.  His solution is good, and is interesting BUT the solution is not geared to handle cross-business functional applications.  His solution (as he explained it to me on the phone) is target to one-specific-line-of-business.  Frank also went on to explain to me that:  (Please remember: I am not trying to knock Frank – he is a genuinely nice guy, and has some valid points – but he is a vendor, and his views are lopsided at best)

  • He is not a certified Data Vault Modeler
  • He has not ever built a Data Vault model (or at least one that has been “audited” or passed approval by a certified Data Vault Model)
  • That he currently has-no-pain and therefore doesn’t see the value in Data Vault modeling or methodology
  • He openly admits that he has not had to, and does not deal with more than 5 to 10 TB of information, so he does not understand the nature of big-data and what changes it brings.
  • He also stated to me, that in his opinion, because he has no pain, that it is my responsibility to prove to him the value of the Data Vault Model – well, if he has no pain, then I will never be able to convince him that the Data Vault is a good solution for what he is doing.

I’ve said it before, I’ll say it again: if what you have (Frank) is working, then don’t change it.  I have also outlined specific pain points that the Data Vault solves – many times, in my business book, on the web, on my blog entries and so forth.

Let me say this: IF THERE WERE NO VALUE TO THE DATA VAULT – THEN THEIR WOULDN’T BE THE ADOPTION BY CUSTOMERS AROUND THE WORLD, however, there are now over 250 world-wide organizations and government agencies that have adopted the Data Vault model and methodology to solve their pain points because what they were using wasn’t working for them!

Lastly under full-disclosure, I am not a vendor of Data Vault.  Data Vault Modeling is free to use, free to understand, free to read about.  He is a vendor of a solution.  His goal is to sell more stuff to you.  My goal is to educate you, and give you new tools in your tool-box should you choose to use them.

About Critical Review…

I have NO problems with critical review, BUT if you’re going to perform critical review, then you better have facts to back it up.  Please don’t make false claims, and make untrue statements about something in which you have not spent the time getting certified in, or haven’t spent the time even building.  Yes, that’s right…

NEITHER Frank Habers, Nor Dejan Sarka have EVER been to certification class – nor have they EVER built a Data Vault Model which would have had to be approved by a certified Data Vault Modeler as an accurate implementation of the Data Vault.

First, some background about Dejan Sarka…

He wrote a negative (some call it critical, I call it hot-air) review on his web-site.  Here are the real facts behind THAT story….

  1. I was asked to consult with an organization that wanted to build a Data Vault
  2. The company Dejan worked for was a “primary sub-contractor / approved vendor” for this customer
  3. The customer asked me to “get to know their primary” – so I had a phone call with their CEO and Director/CTO
  4. The phone call lasted all of 15 minutes, they didn’t want to “lose business” to me, and they didn’t want to get their people trained in Data Vault
  5. I then sent the first 5 article links to them at their request
  6. They responded by forwarding these articles to Dejan.
  7. I never got a phone call from Dejan – he simply reviewed the articles in one day, didn’t read anything else and made false assumptions and posted a blog entry.
  8. This was all to keep the customer from forcing them to learn Data Vault, and from contracting with me.
  9. What they didn’t understand was that I wasn’t going to take their business away from them, but they just didn’t want to spend money on learning something new.
  10. After Dejan posted, I tried and tried, and tried to post a reply, a rebuttle on their web-site/blog entry – but it kept giving me technical errors – it simply could not and would not submit comments

For Frank to reference Dejan, means he truly doesn’t understand the Data Vault – which means he has no right to openly and publicly criticize something he doesn’t understand.  If Frank wanted to reference Dejan, he should have gotten the facts from me first – before referencing this guy.

Anyhow Critical Review is necessary, and most welcome – and it has happened on the Kimball Forums as well recently (which is fine).  The gentlemen on the Kimball Forums at least got their facts straight (for the most part).  I would gladly welcome critical review from Certified Data Vault Modelers who have also built two or three approved / audited Data Vault Models for customers.

With regards to Frank…

I like Frank, interesting guy.  Got some good comments, raised some good questions (which I blogged on before).  However, I spent 3 hours on the phone with him explaining the Data Vault, and the facts of how the Data Vault come to be.  I haven’t done this for anyone else that I can remember.  I told him I would consider a second reply based on our conversation.  He agreed to wait for the reply, but nope – he jumped the gun… twice, posted a full blog entry in English with some inaccurate information, and now, a vendor based – I want-to-sell-you-my-solution based article in a magazine, again with false facts, and misguided assumptions.

I will go through his “magazine article” in my next post, and refute the false claims, and challenge him to post FACTS to back them up.  For now, let’s take a look at the “written / as yet unpublished” response he sent me in my word doc that we exchanged…

Written Exchange Reviewed

He said: The DV-modeling technique does not add any additional value in the backend of the EDWH compared to the HDSA, considering the requirements of the backend.

I say: Hogwash.  The DV modeling technique adds a tremendous amount of value in the back-end of the EDW by forcing the team to understand the business.  By forcing the team to relate the business keys to the data to the business processes up-stream.  By going through this exercise and putting business keys in place, the team can help the business focus on GAP ANALYSIS and finding / exposing the gaps between the business and the actual data that the operational systems are picking up.

An HDSA (persistent historical data storage area) (in his definition is a direct 1 to 1 copy of source structures), causes the following: massive replication of data – there is NO delta checking on load.  So he has to rely on appliances like Netezza to do the compression for him.  Furthermore the tables are disparate, non-integrated (in any fashion), and often contain duplicate data across multiple sources.  There is NO data modeling to speak of, and NO understanding of the business processes to work through, and a much higher number of indexes needed to get the data out – (unless of course again, he relies on Netezza to handle the indexing for him through the FPGA logic).

He said: A backend-system based on the DV-modeling technique requires more effort in terms of design, development, maintenance and management than a HDSA, because the data is remodeled.

I say: Partly true!  BUT you end up with a much tighter and DE-COUPLED solution (De-coupled from the source system architecture), AND the IT folks actually begin to understand the business by exercising this process.  Which means that the IT workers whom understand business become more valuable to the business – especially when they go to create the “Data Marts” down stream.  the HDSA is just a collection of copied tables, tightly coupled to the source system.  I would say: this: YOU STILL SHOULD BE MODELING YOUR DATA PROPERLY, in an attempt to understand the business, and it’s been proven that de-coupling the data warehouse from the source system makes the entire architecture more flexible in the long-term solution set.

Furthermore it sounds as though the HDSA that Frank proposes (warning: this is my assumption, and may be incorrect) copies all the source system data across every day or every load cycle.  How else can you keep the entire “snapshot” in synch with the source system using his architecture?

The Data Vault Model (being decoupled from the source) only requires a) an audit trail, b) only the changes (if available), c) only in the worst case, do we need to copy the entire source data set on every load  – 80% to 99% of the customers I work with are able to work with solution “a” and solution “b” – meaning that the Data Vault provides for growth in data, and very large data sets – scalability – without re-engineering.  With the “pick and plunk” approach to the HDSA, you can easily outgrow the tightly coupled solution in a matter of months (or even in the first week).

Remember: if you are making “snapshot” copies of the source, you are forced to snapshot all the data across the entire source in order to get an accurate picture on every load cycle, otherwise your snapshots get out of synch and don’t tie together.  This is NOT a scalable architecture.

He said: The requirements of the frontend-system are: suitable for end-users (simplicity) and an optimal performance.

I say: Great! We agree – this is why I constantly state: the Raw Data Vault (integrated by business key, not simply a “copy” of source system data) is considered back-end, and that we BOTH put data marts (star schema, cubes, flat-wide tables, etc..) on the “front-end” for user access.

HOWEVER… with Franks’ HDSA approach he needs two full copies of the data, that’s right!  TWO FULL COPIES OF THE HISTORICAL DATA.  Why?  Because in the HDSA the raw data lives,  BUT it is NOT-INTEGRATED BY ANY MEANS!  Therefore – to provide the business with an integrated “data warehouse with history”, he has to copy all the history AND INTEGRATE IT, AND APPLY BUSINESS RULES, and place it in the “Conformed Dimensional Warehouse with All the History”.

Although, one could argue the other way, and say:  Both the HDSA and Raw Data Vault (integrated or not) allow the implementation team to selectively choose what data to “integrate and massage” to put forward in to star schemas and front-ends, and they’d be right.

BUT: HE CONTINUES WITH THE NEXT BULLET (which to me explains why he NEEDS or MUST HAVE two full copies of all historical data)

He said: Our fully integrated dimensionally modeled DWH (based on user requirements) meets those requirements. I do not recognize the problems with dimensional modeling you mentioned at all. We have been using this modeling technique for many years in many environments.

I say:  That’s wonderful.  With a Raw Data Vault in place, followed by a Dimensional model on the front-end, I can achieve exactly the same results (meet requirements).  I’ve also been using the Data Vault modeling techniques for many years, in many environments with great success.

He said: the DV-modeling is not an optimal modeling technique for the frontend. Because the DV-data model is a more complex for the business users than the dimensional model (requirement 1: simplicity) and the performance of a dimensional model is better than the performance of the DV-data model (requirement 2: performance).

I say:  True, we both agree – I never claimed that the Data Vault was aimed at the front-end…

On the rest of the statement: Get your facts straight.  This argument that you put forward is inaccurate.  The “DV model is more complex for business users than dimensional model”  Yes, true – THAT’S WHY THE DV MODEL IS TARGETED AT BACK-END SYSTEMS!  That’s also why I advocate dimensional models, cubes, and other modeling techniques for release to the business users.  You cannot and should not ever try to compare the Data Vault Model to “star schema model” as a release of data to the business users.  This is just not the way I’ve ever setup the architecture.

Regarding Performance: See my blog entry on normalization…  BY THE WAY, This claim he makes is complete and utter nonsense. Why? Because performance is a function of the environment.  Put a Data Vault Model on a column based data store, then put the star-schema model on a column based data store…  IT WON’T MAKE A BIT OF DIFFERENCE.  A column based data store is JUST THAT – a column based set of tables (one column per table).  In this environment, EACH MODEL becomes a LOGICAL model…  So, to make that claim on performance is complete and utter rubbish – unsubstantiated.  To make this claim true you would first have to agree that both modeling techniques are appropriate and well-suited to ad-hoc querying which is not the case.  The DV model is not, and never has been geared towards ad-hoc querying.

PLEASE CHECK YOUR FACTS FRANK BEFORE MAKING BLANKET ASSUMPTIONS AND INCORRECT STATEMENTS.  Come talk to me after you’ve been involved in performance and tuning for 15 years on systems that range in size from 50 Terabytes to 3 petabytes – I would welcome any discussion on this level.

He said: Conclusion: The DV-modeling technique does not add any value to the backend or frontend of our enterprise DWH architecture.

I say: RUBBISH! Complete and utter rubbish.  Once again, neither part of this statement is true at all.  There is a tremendous amount of value add to the back-end by implementing a Raw DV architecture in an enterprise data warehousing system.  By the way, I define “back-end” in this context as being a non-user accessible data warehouse storage area.  Just a FEW of the value add reasons are listed below:

  • Integrated business keys
  • Separation of data by type / classification and rate of change
  • Understanding of business keys, business process, by IT implementers
  • Ability to perform GAP analysis across multiple systems disparate data sets (because of the integration by business keys)
  • Real-time AND batch at high speed performant inserts (because SATS are split by type of data and rate of change)
  • Scalability, Auditability, Flexibility
  • Organic growth of the model over time WITHOUT LOSS OF HISTORY!
  • Loosely coupled data (source system is DECOUPLED from the data warehouse, making it easier to absorb changes and new systems)
  • Standard, repeatable, structures and patterns for load
  • Audit trail based data sets
  • No need to load entire copies of the source on every load cycle
  • separation of KEYS from ASSOCIATIONS / TRANSACTIONS from DESCRIPTORS
  • Sourcing problems solved, timing and source availability solved
  • Parallelism enhanced (through the use of the DV modeling architecture)
  • Performance on MPP platforms is a huge boost
  • Easy to manage data models
  • Understandable data elements
  • “Pull together of data attributes that describe the HUB key/BK or the Relationship / LINK”

Value to the front-end:  No, the Data Vault Model should NOT be used as a front-end, but does it add value to a front-end data mart?  YOU BET!

  • You are no longer forced to conform dimensions that aren’t meant to be conformed
  • You can build, deploy, tear-down, and re-build new data marts all day long without losing history.
  • Business keys have already been “attributed” – pulling together all those attributes that make sense to put together in a dimension or a fact table
  • Because grain shifts are handled by Links, less “grain errors” are made in designing the data marts (grain shifts are more visible)
  • Data across multiple source systems has already been integrated, where in HDSA – the data is still disparate, and separated across many “copies” of different source tables
  • Patterns of data are easier to spot in the Data Vault
  • Master Data Sets are easier to create on the front end, why? because the business keys have been analyzed already.

There are more, but I won’t continue this discussion.  It’s time to bring this one to a close.

He says: Let me guess, you do not agree. ;-) The big question is: where do I go wrong in this reasoning?

To explain your insights please answer also the following (most important) questions:

  • In which circumstances is the DV-architecture (including modeling) more efficient than our architecture? And why?
  • Apart from the comparison in terms of efficiency, in which situation (e.g. characteristics of an organization, the data, the environment, etc.) should you use the DV-architecture instead of our suggested architecture? And why? What is the added value?

I say: I’ve already answered these questions in this post, and all the other blog posts I’ve put together.   Please do more reading, and more background checks before making false assumptions and inaccurate statements.

Next up – part 2 of the article which will be a rebuttal to his magazine posting.

Again, IF you have comments about my reactions, or my post – I welcome you to post them here.  Please let me know what you think.

Thank-you kindly,
Dan Linstedt

Retweet
 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
4 Comments  comments 

4 Responses

  1. Erik

    Hi Dan, you got your facts right, that’s for sure.

    The problem with people writing about DV is that most of them are indeed not certified and therefor in most cases assume DV is just another technical data architecture variant, like a HSA, KImball BUS or CIF.

    They do not know in detail about the DV data modelling standard and the DV methodology, both being part of the concept op DV. That’s why discussion like this are not going to come to a result: it’s comparing apples with pears…

  2. Raphael

    Dan,

    Your entry has another great side effect: I nice “laundry list” of Data Vault bullet points that can be used by all DV practitioners for “sanity check”. Thanks for taking you time for argumentatively push this guy against the wall.
    As a DV architect and implementer with number of DV’s under my belt, I support every statement in Dan’s blog, or in DV methodology, for that matter, because it WORKS.

    Cheers!

  3. Erik

    For the record: I’m not DV-certified, I do not build DV’s, so I will not comment on any specific details in this discussion. However, I have experience in managing teams that use DV as a data modelling technique and methodology, therefor I’ve seen what value DV can bring to both customer and supplier.

  4. [...] The second entry is a public rebuttal to a challenge about the validity of the data vault methodology: http://danlinstedt.com/datavaultcat/datavault-new-response-to-frank-habers-part-1/ [...]

Bad Behavior has blocked 2000 access attempts in the last 7 days.