Hardcore Table Comparisons: Dimensional and Data Vault

There are a lot of comments, and questions out there about the Data Vault model, particularly from those who claim to know it and understand it.  Yet some of them have a) not engaged me, nor b) gone to take the certification class.   Reasons for the DV model are buried in the methodology – a lot of benefits come from using both the data modeling techniques and the methodology.  That’s for another post.  In this post I will try to focus JUST on the physical table structures, similarities and differences, so that the physical nature of the structures can be “compared” (so to speak).

Introduction

First and foremost, please don’t forget: the Data Vault Model is a hybrid of both 3rd normal form structures and dimensional modeling structures.  The idea is to “gain the best of both worlds” but leave behind some of the problems that plague the models when applied as an enterprise data warehouse.  You will see similarities in both.

NOTE: some people believe that I don’t understand dimensional modeling, or that I’ve never built a star schema in my life….  Let me just say: I am a CBIP certified master, and ICCP master for Data Architecture.  I have over 25 years experience, and have built probably close to 100+ full enterprise data warehouses with their own star schemas.  On the other hand, if I get anything wrong here (as I can, and do make mistakes), please feel free to correct me by adding your corrections to the comments below.

Let me just say this: A Dimensional Structure that is loaded with raw data can be compared to a Data Vault enterprise model.  A Dimensional Model that is loaded with business massaged / changed data cannot be compared to a Data Vault Model – but rather a business data vault model,  or better yet – another data mart model (which may or may not be dimensional).  What I am going to speak of in his post is a direct comparison of purpose for structures housing raw data.

Dimensional Structures

The dimensional modeling structures have two basic designs:  1) Fact Table, 2) Dimension Table.  From there, there are applications of these structures; for example: type 1, type 2, type 3 for dimensions, facts, factless facts, and facts with degenerate dimensions attached.  In order to analyze the dimension structure, each application must be taken in to account.

The Type 1 Dimension Structure

The type 1 dimension structure basically says: only 1 copy of the current data (based on business key) can / should exist in this dimension for any and all given points in time.  So, that said, the following holds true:

  1. The business key and surrogate key are 1 to 1 for relationship
  2. The structure contains numerical sequential surrogate keys which are dropped and re-built on every load.
  3. The structure contains descriptive data

The data in the fact tables attached to the dimension (children of this dimension) must undergo the following rules in order to maintain a consistent picture:

  1. The FACT table *MUST* either be updated in place (to reset all the sequence numbers, which is near impossible in orders of magnitude of scale – ie: 100M rows in the dimension refreshed with every load), OR the fact table *MUST* be re-loaded to adjust for the new dimensional keys, which could mean a complete loss of history in the fact tables (if there were facts that existed in the previous load, are now orphaned because the current load is missing the dimension key).

In other words, with this type of data model, raw data history doesn’t stand a very good chance of surviving.  This doesn’t bode well for an enterprise data warehouse that is supposed to be historical in nature.

Type 2 Dimension Structure

The Type 2 dimension structure basically says: keep the old records in place, when a DELTA arrives, make a complete copy of all the columns, and re-insert an entire new row (with the changes applied), also, insert the NEW rows not yet found in the dimension; delta records again, are recognized by BUSINESS KEY.  So, that said, the following holds true:

  1. The business Key and surrogate key are a 1 (BK) to many (surrogate) relationship, causing duplication of the business key data.
  2. The structure contains numerical sequential surrogate keys which are continued on every load.
  3. The structure contains a DATE or DATE TIME stamp of some sort, indicating WHEN this data “begins”
  4. The structure contains a DATE or DATE TIME stamp of some sort, indicating when this data “ends”  (both dates refer specifically to a SINGLE temporal data life-cycle)
  5. The structure contains descriptive data, many times with a majority of duplicate column values once delta’s are loaded

The data in the fact tables attached to the dimension (children of this dimension)  must undergo the following rules in order to maintain a consistent picture:

  1. If the FACT table data arrives BEFORE the dimension data, a “dummy row” must be inserted to the dimension as a place-holder in order to generate the corresponding surrogate key.  Often times, when the dimension data is loaded, the “dummy row” is updated in place, which causes a loss of fidelity for accountability and audit purposes, but allows the SQL to “balance the data properly”.  In other words, if the update occurs, we lose the ability to “see” how often, and when the fact data arrives before the dimension data.
  2. At the discretion of the implementation team and the designers: Sometimes “old” fact rows, tied to previous type 2 dimension records must have their keys “updated” or changed, to reflect only the CURRENT dimension data.  I would argue that this severely limits the accountability of the raw data that is supposed to live in this system.
  3. The fact table may or may not contain “begin / end” date temporality to coincide with the data life-cycle of the facts.  Which, in some cases causes a “type 2 fact table” to arise.  However, I have only seen this on rare occasions.

With this type of model, updates against raw data sets are still occurring.  Ok – so we are *NOT* updating user-based data, that’s a very good thing, however we are often updating “when this record appeared” data, and changing the nature of the information.  Furthermore, when an update occurs against the FACT table to “re-align” facts with new dimension keys, then we lose nearly “all” semblance of history.  In other words, once the update is finished, we have NO WAY to go back and run “old reports as they appeared yesterday/before the update.”  Remember, I’m talking RAW DATA here, because its’ the only way to compare against the Data Vault.

Type 3 Dimension Structure

The Type 3 dimension structure basically says: keep a limited (finite) set of history for one or more pre-defined columns.  It isn’t used very much these days, because it is difficult to maintain, and difficult to load.  It does however offer a single row for the dimension, with a very limited view (past X data sets) in a horizontal or pivoted form.  In other words, if I want to keep 5 history records of NAME, then I would build 5 columns: NAME_1, NAME_2, NAME_3, NAME_4, NAME_5.  The process would “roll” the data from one history column to the next until the data set is “rolled out” and lost forever.   So, that said, the following holds true:

  1. The Business Key and Surrogate Key have a 1 to 1 relationship
  2. The Structure contains numerical sequential surrogate keys which are continued on every load
  3. The structure contains “PIVOT” columns to house history, causing the structure to become very wide, very quickly.
  4. The Structure usually does NOT contain a “begin or end” date/time
  5. The structure contains descriptive data, with horizontal duplication of data.

The data in the fact tables attached to the dimension (children of this dimension) must undergo the following rules in order to maintain a consistent picture:

  1. The FACT table can ONLY REFLECT the CURRENT FACTS of history, as it has only 1 surrogate key to attach to.  The historical descriptive data may be available, but their cannot be any stored historical fact.

With this type of model, updates against raw data sets are still occurring.  This time, the update process IS updating user data sets (however it is not changing the data itself).  The possibility of error for moving the wrong data to the wrong field increases, but can be mitigated through proper unit testing.  The worst part?  HISTORY IN THE FACT TABLE IS LOST/OVERWRITTEN!!   The history in the fact table attaches to 1 and only 1 surrogate key, therefore a history of the facts cannot be kept.  This (to me) is a very serious oversight for anyone wanting to build an enterprise data warehouse that is accountable.

Fact Table Structure

The fact table structure basically says: keep every possible “fact”, that is numerical, or date-time driven (where calculations can be performed), for all history, for all time.  Marry the rows in the fact table to their corresponding dimensions through the dimensional keys, furthermore: the grain of the fact table is partly defined by the number of dimension keys that the fact table contains, and partly defined by any aggregation function used up-stream of loading the fact table.  However, I get off topic…. when speaking of raw data, there should be no aggregation to speak of when loading a fact table.  Please note: just because you see two of the same defined dimension keys living in a fact table, doesn’t always mean that you can simply join them together.  So, that said, the following holds true:

  1. Business Keys are NOT (usually) a part of any fact table structure
  2. Surrogates are defined as foreign key dependencies to their respective parent dimensions.
  3. Generally, there is one (sometimes more) dimension surrogates for EACH dimension used to categorize the facts
  4. The fact table structures can (but shouldn’t) contain descriptive data, this type of data is generally referred to as degenerate dimension data.
  5. The structure may or may not contain it’s own fact surrogate primary key.
  6. The structure should define and hold true, the combination of dimension keys to be unique across all rows.

With this type of fact table (the only kind of fact table), the data should stay consistent, untouched, non-updated.  However as indicated in the dimension type descriptions above, this isn’t always possible to hold true.  Facts are often updated, or re-keyed, or altered.  When this happens, it is very likely that the definition of the data (the context) of the raw data, is also changed.  I typically don’t like these affects which cause updates to run against a data warehouse.

Dimensional Modeling Best Practice

A typical dimensional model has several “best practices” that are applied:

  1. Type 2 dimensions should always be used
  2. Dummy dimension rows in type 2 dimensions should never be “replaced nor updated”
  3. Fact rows should always remain tied to the dimension row they were originally attached to
  4. Don’t let a dimension get “too wide”, as it becomes wider, the complexity of loading, querying, and performance tuning increases.  Note: I’ve seen raw data dimensions with 166+ fields, 3 to 5 hierarchies embedded – and they are a pain to maintain – even though they are raw data.

Data Vault Structures

In this section, I will discuss the simple structural rules for the Data Vault.  Where possible, I will point out the similarities and differences between the two types of data models.  Again, remember we are talking about raw data in the enterprise data warehouse here, so we can compare (somewhat) apples to apples…  The Data Vault has three physical table structure types: Hub, Link, and Satellite.

Hub Structure

The Hub Structure basically says: tie a single surrogate (for join performance) to a single copy of the business key (again for join performance).  Do not duplicate business keys, do not assign multiple surrogates to a single business key.  Isolate the business key (may be composite), and record when it arrived in the warehouse, and where (what source) it came from.  So that said, the following holds true:

  1. Business keys in the Hub structure are 1 to 1 relationship with the surrogate keys
  2. Surrogate keys exist in the Hub merely for join purposes (because the RDBMS systems work faster on numerical joins than any other type of join)
  3. The Hub structure contains a SINGLE load date (per row), indicating WHEN the first time the key was loaded to the data warehouse
  4. the Hub structure contains a SINGLE record source (per row), indicating WHERE the data originated.

With this type of structure, facts or transactions that arrive in real-time don’t have to create a “dummy row” as they would in the dimension.  They create / insert a new key+surrogate when it doesn’t exist, then they insert directly to a link.  When the descriptive data arrives later, it is inserted to the Satellite structure hanging off the Hub.  This allows the loading process to not be forced in to “updating” any rows, which can get expensive in volume situations.  The similarity lies with a type 1 dimension.  Like a type 1 dimension, the Hub table maintains surrogate keys 1 to 1 with the business key.  Like a type 1 dimension, the hub is the “parent” table for both Satellites and Links (type 1 is parent to the fact table).  The difference is: the Hub does NOT store descriptive data (which a dimension does do).

Link Structure

The Link Structure basically says: tie a combination of business keys (relationship, association) together as a fact for a single specific point in time.  Do not duplicate fact row data, do not deal with “changing facts, or updating facts” because dimensional data changed.  When a fact/transaction/relationship arrives, deal with it – insert it directly to the relationship (Link) table if it doesn’t yet exist.  So, that said, the following holds true:

  1. The combination of business keys (replaced by their respective surrogates) have a 1 to 1 relationship with the Link table surrogate key.
  2. The combination of the business key surrogates (composite) is to be held as unique within a Link structure
  3. The Link structure contains a surrogate key (again, defined for performance reasons by relational database systems)
  4. The Link structure contains a SINGLE load date (per row), indicating WHEN the first time the key combination was loaded to the data warehouse.
  5. The Link structure contains a SINGLE record source (per row), indicating WHERE the relationship was loaded from.
  6. The Link structure does not contain descriptive data ** exception: transcational link which is 1 to 1 with a Fact Table definition **
  7. The Link structure does not contain BEGIN or END dates (I ask you this: would you add begin/end dates to your facts in a dimensional model?)

With this type of structure, facts or transactions that arrive in real-time don’t have to create a “dummy row” as they would in the dimension.  They create/insert new combinations of business key surrogates directly in to a link when the data appears on the feed.  When descriptive data arrives later, it is inserted to the Satellite structure hanging off the Link.  This allows the loading process to not be forced in to “updating” any rows, which can get expensive in volume situations.  The similarity of the Link lies with a FACT table.  Like a FACTLESS FACT, the Link structure is a “join” combination, a set of business key relationships defined to be a specific grain.  The difference is: in the fact table we can and usually do store facts.  In the Link, the facts are generally moved to a child table called a Satellite.

Satellite Structure

The Satellite Structure basically says: hold all the contextual data (the descriptive data) (the data that changes over time) in a separate, child structure.  Make that descriptive data about the key or key combination (relationship), and nothing but the key (or key combination).  When the data changes, insert a delta record to the Satellite, never insert duplicates; furthermore, if certain columns are changing faster than others, then split the columns (vertically partition) in to separate Satellites for better performance, and less repetition of data.  Most importantly, the columns in the structure can be split or merged at any time without loss of fidelity (ok, there is minimal loss of fidelity across the system generated columns) but no loss of fidelity across the user based data sets. SO, that said, the following holds true:

  1. Satellite structures inherit their “master key” from either a Hub OR a Link.
  2. Satellite structures primary key also contains a Load-Date indicating WHEN the snapshot of this data was loaded to the data warehouse
  3. Satellite structures attributes contain a Load-End-Date indicating when this copy of the data was superseded by a CHANGE to the data set.
  4. Satellite structures attributes contain a Record Source, indicating WHERE the data set was loaded from
  5. Satellite structures may also contain a sub-sequence number (read about it in the book: Super Charge Your EDW @ http://learnDataVault.com
  6. Satellite structures have 1 and only 1 parent table, leading to a B+ tree structure in the data model design

With this type of structure, the data sets can be loaded, split, merged at any time in the life-cycle of the data warehouse, without disturbing history, and without needing to “update data” as is the case in the Dimensional Model.  Furthermore, when new systems are added, new Hubs & Links can be added without re-engineering existing loads.  In the dimensional model, to add columns to an existing load (remember: raw data)…  requires re-engineering to happen.  The existing load (in a dimensional model) must be altered, which means that all the data in the dimension and facts must be re-tested in order to ensure that the change works properly.  This gets more complex with the more columns and the more hierarchies that are added to the dimension.  However, this structure is similar to a dimension structure in that it houses begin / end dates, and descriptive data.  It is different than a dimension structure in that it a) is not forced in to conformity, b) can be split and merged at any time without loss of history, c) does NOT originate/create the business key/surrogate key relationship (it is a dependent child).

Conclusion

I’m not here to tell you which model you should use, you need to make your own informed decisions about this.  But I am here to help define and deepen the understanding of the Data Vault Model, why I created it, what the limitations of even raw dimensional models have, and to further dispell any myths or false beliefs about the Data Vault modeling structures.

My conclusions are as follows: from 3rd normal form, I borrowed the many to many relationship physical structure.  That became a Link table.  It just so happens to also be known as a factless fact.  From Dimensional modeling (type 1 dimension) I borrowed the notion of 1 to 1 (business key to surrogate) tracking, you might also think of this as very close to 6th normal form; it’s not a complete representation of 6th normal form because there is redundency at the data level for load-dates and record sources.  From Dimensional Modeling (type 2 dimension) I borrowed the notion of “data over time in a separate table/structure”, that’s where the Satellite began life – however the Satellite is fundamentally different, in that it is a child dependent table, where the dimension (any type) is a parent table to the facts.

The Data Vault is a hybrid approach, and is built / designed to overcome the specific problems/issues that I’ve listed here.  Again, this case is based on raw data warehousing, making the assumption that BOTH METHDOLOGIES have moved the business rules “downstream” or “post-warehouse”.

Please let me know what you think, I’m curious to hear from you.

Cheers,
Dan Linstedt

Tags: , , , , , , , , , , ,

6 Responses to “Hardcore Table Comparisons: Dimensional and Data Vault”

  1. Marco Schreuder 2011/09/01 at 3:41 am #

    I don’t like all the emphasis on the differences.
    I would like to get the focus on co-existance within a data warehouse.
    There is a claim that 9 out of 10 data warehouses build in the Netherlands are Data Vault based. But the rest of the world?
    I think a claim that 8 to 9 out of 10 are based on dimensional modeling would be justified.

    From a Kimball perspective Data Vault is an implementation of a persitant staging area. Problems that people have with this view seem to boil down to semantics. I personally don’t have a problem with this view. A focus for further acceptance of Data Vault should be that it is the best implementation of a persistant staging area.

    As for the type 1-3 definitions: they are tied to the attributes in a dimension. So you can have several type 1 and type 2 attributes in one dimension. For type-1 attributes you would typically update the dimension. So you don’t drop and rebuild surrogate keys and there is no need to update the fact table.

  2. dlinstedt 2011/09/01 at 4:33 am #

    Hi Marco,

    Thank-you for your comments. The differences and the emphasis on the differences is important to point out. Some people have come forward believing that the Data Vault is not new, but just another method of putting together a star schema. If this were the case, then there wouldn’t be so many differences. Regarding co-existence…. that’s not the purpose of this post. This post is focused on structure comparisons only, not application thereof, and not how to load (although there is a bit of that buried here). However your suggestion is a good one, and I will see about posting another entry with the focus on co-existence.

    Regarding how many data warehouses (ratio / percentage) are built one way vs another, I cannot comment – as I have no mechanism for creating this type of poll, or justifying any sort of answer. This is the first time I’ve heard this ratio. However, in most of the rest of the world, data warehouses have been built on Star Schema / dimensional modeling techniques. Why? Because it’s been around and available to the public a lot longer (when compared to the Data Vault model). However, I am starting to see many more organizations (big ones) who have tried and reached a point of exasperation with their existing dimensional based warehouses – pick up the Data Vault. Although it’s still “slow adoption rates”.

    Now, if you want the Data Vault as a persistent staging area, be my guest… But remember this: I define (as does Bill Inmon) a Data Warehouse to be: time variant, non-volatile, integrated and so on… That means it really isn’t a PSA – but rather a true enterprise data warehouse with Raw Data. What you call it doesn’t matter to me, but I would argue that the label “PSA” does not do it justice, that it is in fact a full raw, accountable, EDW.

    Yes, you can mix attribute types in a dimension – however, this leads to an increase in complexity of the loading and management processes. Testing becomes harder (more test cases to be developed), and accuracy can come in to question. Moreover, as the complexity rises, the performance WILL drop.

    Thanks for the great suggestion, I will work on new posts soon.

    Cheers, Dan L

  3. Daan 2011/09/01 at 12:35 pm #

    I’m not an expert in Data Vault modelling, but if you allow purposefully incorrect data to be loaded or stored in your (enterprise) data warehouse, you allow managers to take wrong decisions. It looks to me as a very bad idea. The Data Vault theory says: business users are not allowed to query the Data Vault. My first response will be: who is paying us for building a data warehouse? Exactly, the business managers do pay us. So, the people who do pay us (often a lot of money), may not use it!? It sounds as a strange idea. I agree with Kimball that a Data Vault has a strong resemblance of an implementation of a Persistent Staging Area (PSA) or if you want a Persistent Data Store (PDS), but not an Enterprise Data Warehouse.

    Maybe people will argue, in the Data Vault architecture there are data marts on top of the Data Vault and business users can query the data marts. But, to fill each data mart with meaningful and correct data business users understand and can trust, you must every time apply the same business rules (because many data marts consist of multiple tables with a different level of granularity about the same subject, representing the same indicators). That is in my view not the way to go. It’s bad for performance and it means that business rules are scattered across the load process of the data marts. Data marts should in my view just aggregate the (transformed) data and nothing more. Okay, maybe to calculate some additional measures where data is needed from more than one fact table.

    I understand that a data vault can be a good solution (there are cheaper alternatives) when it comes to traceability and auditability, but not as a modelling technique for an (enterprise) data warehouse. I’m also curious if there is some research available exploring the costs and benefits of Data Vaults compared to Kimball’s approach.

  4. dlinstedt 2011/09/01 at 1:31 pm #

    [but if you allow purposefully incorrect data to be loaded or stored in your (enterprise) data warehouse, you allow managers to take wrong decisions]

    Clearly I need to define the value of bad-data, not just from an auditability perspective but also from a perspective of Gap analysis, and cleaning up the business. Thank-you for this insight, I will create more blog entries to define this space, and why “good, bad and ugly” data are all needed to make better business decisions.

    [The Data Vault theory says: business users are not allowed to query the Data Vault. My first response will be: who is paying us for building a data warehouse?]

    Ok I disagree with what you say: I believe they aren’t paying us to build a data warehouse, they are paying us to build a Business Intelligence Solution (hopefully lending return on investment). A Business intelligence solution offers more than just the data warehouse – it also includes the ability to find & fix data quality problems, business process problems, as well as provide better information for business decision making. The full solution INCLUDES a data warehouse component in order to store historical data. The business wants & needs BOTH the ability to audit the data sets, AND to make better business decisions. If the business did not need both, then the industry would have stuck with “star schema dimensional analysis” that did not include historical data sets from a warehousing perspective.

    There is gold in understanding and analyzing “bad data” – figuring out WHY it’s bad, what causes it to be bad, when & how often is it being produced? What percentage is “Bad?” Furthermore, what is “bad today” by today’s business rules isn’t necessarily bad tomorrow – especially if the business changes its mind. It’s at that point, when the business changes its mind, that if you DON’T have any of the raw historical data stored, that you have to go back and a) reconstruct it (difficult to impossible), b) try to find it on back-up tapes (very time consuming) and if this includes external data, could be next to impossible.

    [I agree with Kimball that a Data Vault has a strong resemblance of an implementation of a Persistent Staging Area (PSA) or if you want a Persistent Data Store (PDS), but not an Enterprise Data Warehouse.]

    Ok – you can use the Data Vault model as a PSA… But my definition of “data warehouse” is what is different here. My definition has the following characteristics: non-volatile, time-variant, integrated by business key, consistent. So, under this definition, my “PSA” IS a Data Warehouse… Don’t confuse the term “Data Warehouse” with “Business Intelligence Solution” or “Data Warehouse Solution” – these are very different things.

    In fact, on page 15 – figure 1.1 of Kimball’s’ book: Data Warehouse Lifecycle Toolkit, Kimball clearly shows that “The Data Warehouse” is a LOGICAL thing, not a PHYSICAL thing. His claim is that “The Data Warehouse” is a conglomeration of “Data Marts” – making up the enterprise data store. I completely disagree. My definition of “Data Warehouse” is that it IS physical. I would quote my good friend Bill Inmon here: “A school of minnows swimming around together, does not equal a whale.” In fact, if you read the rest of the paragraphs on pages 14, 15, 16, 17, and 18 – you would be hard pressed to actually FIND a true definition of DATA WAREHOUSE. He defines a staging area, presentation server, dimensional model, business process and data mart.

    Not until he reaches page 19 where he “defines” his vision of a data warehouse: “The queryable source of data in the enterprise.” What? This means that OLTP systems could be considered “THE DATA WAREHOUSE”. He goes on to say: “The data warehouse is nothing more than the union of all the constituent data marts.” There is NOTHING in his definition that discusses or even mentions time, historical data, consistent, accurate, or otherwise. He does not have a definition for a physical data warehouse. In his mind (according to his book), the Data Warehouse is a “conglomeration” of data marts….

    [But, to fill each data mart with meaningful and correct data business users understand and can trust, you must every time apply the same business rules (because many data marts consist of multiple tables with a different level of granularity about the same subject, representing the same indicators). ]

    I believe this is an incorrect assumption. First, Kimballs definition of Data Warehouse surprisingly does not include any statements about accuracy or trust, or quality of data. Second, there are hundreds of ways to build and re-use logic, or business processes that are embedded in ETL. If the ETL tool can’t or doesn’t allow RE-USE of business logic, then it may be possible that the wrong tool is being utilized. These days, most ETL tools allow common definition of logic in a metadata store, to be re-used over and over again. Furthermore, IF you are conforming a dimension – why would you need to “reuse the logic that loads that one dimension?” Correct me if I’m wrong, but the purpose of a conformed dimension is to provide a SHARED data resource. This would mean that only a single ETL routine is needed with common business logic to fill and populate it.

    Kimball goes on to say: when you load a dimensional model, you should load the lowest possible granularity FIRST with as many possible conformed dimensions. THEN create aggregates and other point-solutions that feed from the conformed dimensions and lowest level of grain.

    IF you follow the Kimball rules, then loading a single conformed / federated star schema (dimensional model) is possible given a SINGLE set of ETL logic.

    [That is in my view not the way to go. It’s bad for performance]

    IF this were bad performance / bad technique, WHY would Kimball suggest a persistent staging area to be used in the first place? Having a PSA serves as part of the Business Intelligence Solution set – which means, you do exactly the same thing when loading conformed / federated star schemas from a PSA. Furthermore, please read today’s post about parallelism, mathematics, joins, and performance – it should answer some other questions.

    [I understand that a data vault can be a good solution (there are cheaper alternatives) ]

    Please define the cost in real-numbers? Can you provide us with some examples of the Total Cost of Ownership, how you are calculating this and why the Data Vault is “more expensive” than other alternatives?

    I am very interested to see hard-numbers on this, that would indicate that the Data Vault is indeed either cheaper, or more expensive according to your claims. The point is, there are so many different ways to measure total cost of ownership. There isn’t one single universally acceptable formula by which we can accurately compare (nor make such claims). In the end, what works for the customer is what’s important. In the end, if what the customer has IS WORKING, then don’t change it. In the end, people are naturally resistant to change – as well as resistant to things they don’t understand. They do their best to “knock things down” without a full basis of understanding, and without a full justification of the inner workings.

    What I will say is this: there are many companies who are using Data Vault models and methodology successfully and are happy with it. Likewise, there are many companies using the Kimball Star Schema and conformed dimension approach and are happy with it. But…. when a company ISN’T happy with either 3rd normal form, NOR dimensional modeling (for whatever reason it ISN’T working for them), where can they turn?

    They need something different to try – because they’ve exhausted the alternatives. The Data Vault Model and Methodology provides them with another approach, a different approach – suffice it to say, I know of companies who have succeeded with Data Vaults where they have failed (time and time again) with the Kimball Dimensional Star Schema approach.

    Don’t get me wrong, by no means am I suggesting that they “got it right” when they tried the Kimball approach, they may very well have missed some critical components. What I am saying is that they WANT something different… Why not offer them a viable and proven alternative? Where’s the harm in that?

    [but not as a modelling technique for an (enterprise) data warehouse. I’m also curious if there is some research available exploring the costs and benefits of Data Vaults compared to Kimball’s approach.]

    You and I clearly disagree on cost items, and until you can come to the table with hard-facts about cost, please refrain from making such bold claims about what is cheaper, and what is more expensive.

    Cheers,
    Dan Linstedt

  5. Ronald Damhof 2011/09/03 at 10:00 am #

    @Daan; why commenting on this blog-post if you do not know Data Vault, have no experience with it and have clearly never architected/build one? I am soo baffled with these kind of ‘experts’. I welcome any expert with good sound criticism, but to just throw in unsubstantiated blabla, baffles me. You might wanna research the topic at bit more before you engage in meaningful conversations…..

  6. Marco Schreuder 2011/09/05 at 4:25 am #

    Hi Dan,
    As a clarification to my earlier remarks I wrote a blogpost on blog.in2bi.com with the title: Dimensional modeling and Data Vault – a happy marriage?

    Feel free to use (parts of) my text. I’m very interested in your comments.

Leave a Reply

*