Facebook Twitter LinkedIn YouTube RSS
magnify
Home Data Vault Is Star Schema modeling aligning to #datavault?
formats

Is Star Schema modeling aligning to #datavault?

Published on 2013/02/07 by in Data Vault, News

You decide.  I recently read a new “tip” on the Kimball site regarding “new types” of slowly changing dimensions.  I received an email from my good friend Kent Graziano about the tip, but also reminding me that Bill Inmon wrote about this 20+ years ago in his original writings, and that I had added it to Data  Vault modeling 15+ years ago when I published the standard.

If in fact, Kimball & camp is aligning Star Schemas towards Data Vault mdoeling, I guess that’s a good thing, as copying is the sincerest form of flattery.  On the other hand, I’d like to hear your opinions – what do you think Kimball’s post really says?  Does it align with Data Vault Modeling or not?

Of course there is still the fundamental question & issues to deal with like: how does Kimball truly define a “data warehouse?”  and, where exactly should the business rules live?  I know my views on the subject, but I’d like to find out more from you.  Please add your comments and thoughts to the end of this post.

DISCLAIMER:  Please remember: 1) I am biased  2) I am specifically referring to the use of star schema modeling as an EDW in this context – I am NOT referring to the use of star schema modeling for production and release to the business users.  I do believe in star schemas for data release to business users, just not for use as a back-end EDW.

Where’s the post?

http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/

Overlap? What overlap?  Are these truly new innovations?

Kimball says:

With type 0, the dimension attribute value never changes, so facts are always grouped by this original value

In the original specification of the Data Vault methodology I define clearly the use of RAW Data sets that are never changed by business.  One of the impacts to following this method, is that it cannot be conformed, altered, or changed in any way.  However, no where in his definition of a “type zero” does he even remotely suggest that the business rules should be moved downstream.  I would tend to say that if you truly want a “type zero” kind of thing, that you should really be considering using Data Vault modeling in the first place.  At the very least – you will now be required to move the business rules down-stream, otherwise the descriptive data that is “changed/munged/altered and conformed” will change over time – making a type zero near impossible to achieve.

Kimball says:

The type 4 technique is used when a group of dimension attributes are split off into a separate mini-dimension. This approach is useful when dimension attribute values are relatively volatile.

Did I miss something here?  This is the standard method of Data Vault modeling, and has been there from the beginning in 2000/2001 when I released the Data Vault modeling standards.  Splitting of Satellites by type of data and rate of change have always been a good practice.  Now, he finally suggests it for Dimensional modeling.  Ok, no problem – it is a good advancement.  But yet again, I have to ask the questions:

  • IF you are changing data on the way IN to the Dimensional Warehouse, HOW then would you track changes back to a Mini Dimension?
  • And even harder to answer: WHERE and WHAT represents the business key in this “mini dimension”?  In other words, if you don’t have a business key identified, then it is near impossible to go back and “end-date” the old record, and activate the new one.  In fact, in his example, he doesn’t even show “temporal” aspects of data change.

As if it wasn’t hard enough to track changes, he continues on to introduce Type 5, 6, and 7

Kimbal says:

The type 5 technique builds on the type 4 mini-dimension by embedding a “current profile” mini-dimension key in the base dimension that’s overwritten as a type 1 attribute. This approach, called type 5 because 4 + 1 equals 5,

What?  What kind of serious logic is this? You must be joking – using simple addition to justify changes to an architectural design?

Adding 4+1 = 5 is mundane.  What about the serious definition? the architecture considerations? the risk of snowflaking that has been proven to be bad to begin with (bad for performance, bad for design, bad for architecture, etc…)  Now to apply a band-aid to an ailing “modeling technique for data warehousing” he simply says do a little simple addition, and voila – you have a new dimensional type.

He continues:

The ETL team must update/overwrite the type 1 mini-dimension reference whenever the current mini-dimension changes over time. If the outrigger approach does not deliver satisfactory query performance, then the mini-dimension attributes could be physically embedded (and updated) in the base dimension.

Really?  You’ve got to be joking…

He just contradicted himself for the purpose of a mini-dimension that is “type 5″ – it cannot exist according to the statements he’s made.  Let me explain: first, he said: use a Type 4 for the “faster changing information” (the mini dimension), use a type 1 for the main dimension.   Then he said: the outrigger is a type-1 mini dimension and should be overwritten (meaning no history).  WELL FOLKS, WHICH IS IT?  He goes on to then suggest denormalizing (re-combining the type 4 + type 1) back together again to re-form a type 3 dimension if performance doesn’t work…  Wow, so in other words there is no new innovation here at all, and it would be near impossible to construct the correct “mini-dimension” given the multiple definition.

Breaking it down:

  1. Type 4 mini-dimension is updatable, and is broken out because of rapidly changing attributes
  2. Type 4 mini-dimension should really be a type 1 mini-dimension with a type 4 main dimension – making this a type 5
  3. Type 5 is really a type 1 mini-dimension with a type 4 main dimension – that is de-normalized back together at the first sign of performance problems (ending up right back where you started with a huge dimension and copying attributes that change frequently).

The Data Vault Model in contrast is based on mathematical study of data, the rates of change, and the classifications of data.  This study also includes the study of horizontal and vertical partitioning of data for performance and parallelism reasons – hence it’s close alignment with MPP systems (think Big Data and NoSQL).  The Data Vault Modeling paradigm is not abritrary and does not simply “append” styles just to meet the needs of true enterprise data warehousing, no, the Data Vault Modeling paradigm is built on a solid foundation of tried and true methods and best practices.

On the one hand it’s good that Dr. Kimball is trying to innovate – for that I applaud his efforts.  On the other hand to show such non-concern for the nature of the data itself and merely suggest “new types” of design patterns to an already troubled architecture is quite disconcerting.  I fear that these “advancements” will only feed the fires of EDW troubles more quickly.

Kimball said:

Type 6 builds on the type 2 technique by also embedding current attributes in the dimension so that fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value. The type 6 moniker was suggested by an HP engineer in 2000 because it’s a type 2 row with a type 3 column that’s overwritten as a type 1; both 2 + 3 + 1 and 2 x 3 x 1 equal 6. With this approach, the current attributes are updated on all prior type 2 rows associated with a particular durable key, as illustrated by the following sample rows:

What a minute…  he just told you to put measurements or facts in a dimension.  If this is true, then what’s the point of assigning monikers like dimension and fact?  If this is true, then he’s just conflicted the standards and the definitions (not to mention destroying the patterns) for putting specific types of attributes under specific labels.

I don’t know if you caught it: but he states IN THIS PARAGRAPH that the attributes are updated on all prior type 2 rows… Now if I return to the base definition of enterprise data warehouse, I believe that Inmon defines the data set as NON VOLATILE.  That should mean that there are NO UPDATES to existing data sets of user based data!!   He just suggested that an update (which changes data values of data sets that arrived from the source systems) be executed against user based data.  At that point, if you follow this rule, you DO NOT have an enterprise warehouse.  You HAVE a Data Mart, where the information is subject to change.

He is updating the “Current Strategy” column, which is ok – he’s maintaining audibility in the historical column but then again, from a big data perspective how good will the performance be over time?  The speed of the Updates will continue to degrade as the data set grows.

For this case, proper sense would dictate that you build a new “higher level type 3 dimension” and fact table combination on top of the low-level detail.  This higher level of grain would house the current assignments and overrides.  Eliminating the need for updates entirely, also eliminating the need for “current row indicators”

Finally Kimball Introduces:

With type 7, the fact table contains dual foreign keys for a given dimension: a surrogate key linked to the dimension table where type 2 attributes are tracked, plus the dimension’s durable supernatural key linked to the current row in the type 2 dimension to present current attribute values.

Type 7 delivers the same functionality as type 6, but it’s accomplished via dual keys instead of physically overwriting the current attributes with type 6. Like the other hybrid approaches, the current dimension attributes should be distinctively labeled to minimize confusion.

Ok, interesting.  So, you can query by “active label” versus “previously assigned label” or dimension row.  This one almost makes sense.

Conclusions and Summary

In my humble opinion Dr. Kimball is combining (unsuccessfully) two major functions that shouldn’t be combined:

  • Storage, History, Non-volatility (Data Warehouses)
  • Presentation, Preparation and Release (Data Marts)

He’s trying to hard to get the data modeling archtiecture of star schemas to “bend” and meet the needs of the enterprise data warehouse.  He has introduced massive complexity in to his systems, and the ETL routines to keep all this straight will end up as spaghetti code within the second or third project iteration.

A life lesson…  I was always taught to take a big problem in life and break it in to bite-sized chunks to solve it, and not try to force fit solutions or build band-aids as short stop gap measures.  This is why I hold fast to the belief that the data warehouse IS PHYSICAL and IS NOT the same as a data mart layer.  That view provides me the opportunity of separating the “jobs” and responsibilities of presentation from the jobs and responsibilities of historical storage and integration by business key.

In my opinion, again, I applaud Dr. Kimball for attempting to innovate, only in this case, I believe that these innovations will prove out to be ill-fated prescriptions for those who follow the advice, and continue mixing data warehouses with data marts in the same architectural layers.

Please let me know what you think of all this.  Do you think it will work?  Do you see benefits to his approach? Do you see the overlap with the Data Vault Modeling Techniques?

Thanks,

Dan Linstedt

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
17 Comments  comments 

17 Responses

  1. sanjaypande

    Minor correction Dan …

    “What a minute? he just told you to put measurements or facts in a
    dimension.”

    What they’re calling type 6 doesn’t suggest putting facts in the dim.

    What they’re saying is they use a type 2 row (copy and close previous row)
    with a type 3 (copy previous column value to new column).

    The ETL routine for this will be a bastard and if it ever gets corrupt,
    you may as well rebuild the damn thing from scratch.

    So a combination of 2 and 3 is apparently 6. That’s funny on many levels.
    2 and 3 have different purposes. One is row based, one is column based so
    it’s not really a great idea.

    What’s REALLY funny is their attempt to try and over-complicate things for a problem that has already been solved by the Data Vault. The star-schema does have it’s place (in systems that use cubes and MDX and even on the front-end layers), but the generalized approach of a star-schema can solve every problem in the DW space has caused all these creative monstrosities that are the prime reason for BI failures.

    I’m surprised, people are still falling for it. Then again, we all know popular doesn’t necessarily mean good.

  2. The main thing that struck me about all these new SCD types is that if you are using Data Vault, which keeps all the history without change over time, you can project (multiple) virtual dimensions that show the data in whatever form or context your users want. There is no need to mix and match Type 0,1,2,3 to get 4,5,6,7 and convolute your ETL in the process. If the users want to see dimensional data with current values and original values, you can write a view to do that. If they want original values only, you can write a view for that.

    There is no need for a mini dimension to save space, which also adds an object to the reporting layer. You use the split satellite approach for storage reasons, and pre-join that data in a view so the user only has one dimension to deal with. They don’t care about the storage. That is the data warehouse’s problem! Why make them choose from two Customer dims?

    And for those that say the views will be too slow – try it first. Tune your SQL, tune your database. Then, if it is really too slow, create a materialized view or do some simple ETL from the view to a real table.

    I agree with Dan too, all that updating of the more complex dimension types is eventually going to slow down your load process. Can’t really fathom updating every row in a 1 billion row dimension.

  3. I found that those additions make fully sense and I used some already. You completely misinterpreted the type 0. It has business rules and will be reload at every business change but will not track data change. I used it at least once for geolocalisation on IP. I was too expensive to do it all the time so I did it once for every customer and then considered no location change. Perfect? Of course not. I could have been far better with a bit of work but was by no means a priority.

    In the Kimball paradigm, the data warehouse is the set of data mart. Maybe it’s not the way you want to see it but I’m sure you know that. It’s all about business and getting things done.

    On the opposite you focus on the storage and archival part with an heavy technical view. Business rules and usability are second class citizen (“downstream”).

    My business is to get things done and create value for customers. If I want to make a beautiful EDW, sure it would have a DV layer below a Kimball layer. That would be awesome. But that never was the best way to create value.

  4. Thank-you for your comments Sebastien, I do appreciate the candid feedback.

    I want to point out the following:

    You state: “that would be awesome, but that was never the best way to create value…” I would have to disagree with you there. If the Data Vault Model and methodology did not create value for the customers (on multiple levels), then it would not enjoy the success nor the visibility that it sees today. Many of the customers who have built or are building Data Vault models (and following the methodology) are doing so BECAUSE their current “federated star schema” style warehouse has fallen apart, become too complex, or too costly to maintain. Especially in the area of adopting changes for the business – because the changes require continuous and additive re-engineering efforts.

    The cost of that old “cycle” becomes unbearable, and thus the customer ends up shutting down the project, re-naming it and starting it again (because they need a new managable cost structure, and they still need BI).

    Anyhow, I think you missed a couple of my points – some of my main points in this article are to show where Kimball is clearly contradicting himself in how he defines data warehouses. Another point I was trying to make (perhaps I did not do so well) was to show how Kimball is attempting to “adapt” and band-aid an architecture to make it fit (square peg in round hole) for Data Warehousing. Another point I was trying to make, is that these “adaptations” look and feel a LOT like what the Data Vault Model laid out (reasons and justification wise) in 2001. Although the Data Vault modeling constructs do not have so many complex “types or layers” to manage.

    The Data Vault is simple in that regard: Hub, Link, Satellite – and seeks for optimization of both the ETL and the Modeling paradigms.

    Again, thanks for your feedback – I will take your thoughts in to consideration.

    Cheers,
    Dan Linstedt

  5. Sanjay is right, I made a mistake – sorry about that folks. Thanks for pointing that out Sanjay. If anyone else sees a mistake, let me know, and I’ll take a closer look at it.

    Cheers,
    Dan L

  6. sanjaypande

    @Sebastian – The definition of type 0 is a contradiction as Dan has stated and you’ve even given an example of it in your comment. It states the dimension attribute value NEVER changes.

    This can ONLY be accomplished by not having business rules before it. If you’ve done a reload due to a business rule change, you’ve violated the definition itself AND broken audit.

    In my opinion (and experience) it’s impossible for any Kimball style DW to withstand the test of time. Maintenance is too expensive and eventually the system breaks and creates a “life cycle” for the DW and it eventually comes with an expiry date.

    In the DV paradigm only data expires, the DW never does. This can sort of get achieved with a persistent staging paradigm, but the majority of the hard work has to still be redone and it really doesn’t provide much value when compared to a DV.

    “It’s all about business and getting things done.”

    And what makes you think that the DV is not about the business. In fact at the core of the methodology it’s ALL about the business. Much more than any Kimball style implementation can ever be.

    “On the opposite you focus on the storage and archival part with an heavy technical view. Business rules and usability are second class citizen (“downstream”).”

    The DV has clearly defined boundaries for the DW and the presentation areas. The reasons for separating the storage and archival (and more) from the presentation areas are deliberate since they enable rapid delivery to the customer and reduce or eliminate the maintenance cycles in the DW which is the biggest portion.

    “My business is to get things done and create value for customers. If I want to make a beautiful EDW, sure it would have a DV layer below a Kimball layer. That would be awesome. But that never was the best way to create value.”

    A DV is an enabler for business and for the technical team to deliver more predictable results faster. The beauty is in it’s patterns and simplicity that have stood the test of time and enabled very rapid front-end deliveries which can be star-schemas and in most cases are.

    That is surely more valuable than the multiple iterations and all the maintenance and testing you have to do to extend a Kimball style model and even enable simplification.

    Because the storage, audit, CDC and integration points are already taken care of, you have the flexibility to simplify the front-end. That is extremely powerful because you can deliver to the business much faster even if they change the business rules as you never have to go back to the source to reload the set.

    Creating something quickly just to have it break very soon is not synonymous with “creating value”. It’s the opposite.

    Having a DV layer below your Kimball layer will create much more value (that actually lasts) for your customers than not having it. Just make sure you simplify the front-end since you’ve already done the work in the DV.

  7. One more little clarification for this thread – Margy Ross actually authored the article that Dan referred to. I assume she is the primary co-author for next edition of the Toolkit book as well.

  8. “And what makes you think that the DV is not about the business. ”

    Let’s take a simple example. I’m a web game company WebA with a DV with client hub using an accountid (number) BK. I buy another web company WebB, I have to add another client hub (WebClient) for their clients BK. You can probably say it only take 15 minutes with an automated tools because a DV can be so easily automated. Now what? Is it easy to do? Yes ! Clean? Yes !

    But you did achieve nothing at all at this point for the business user. You did manage to get CDC in place (well, except if the CDC side is complicated *) but have now a constellation of tables to join. Is any report now consolidating WebA and WebB? NO!

    You end up with one DV containing two disparate galaxies. Sure you can add some same-as links between the clients WebA and clients WebB hubs. But the DV methodology will not make it happen by magic.

    Real data warehousing work start now.

    A vault is important for a rich man, but it will not make a man richer.

    BTW, i post a small description of the methodology I use on my blog if one want to get more understanding on how I see things.

    * How do you do CDC with a big source table that doesn’t even have a update timestamp and no way to touch the database to put triggers or read the logs?

  9. Sebastien,

    you are missing a few key points:

    1) the Data Vault is for back-end data warehousing, not for production release to business users directly
    2) CDC has nothing to do with the Data Vault. Sure, the load processes to the Data Vault activate CDC on the way in (by definition the nature of the Satellites provide this), BUT in ANY good data warehouse, CDC up-stream on sources is a must, especially as volumes grow. CDC arguments are irrelevant here in this discussion
    3) The Data Vault brings business value by lowering complexity, separating (de-coupling) the business changes from the sourcing changes, and lowering maintenance costs.
    4) A Data Mart is supposed to service the business user, NOT the data warehouse. Flexibility of changing, re-creating, dropping and building new data marts (based on a solid back end data warehouse) is what helps bring agility to the table.
    5) With a Data Vault you do NOT necessarily have disparate galaxies – I don’t know where you got your information, but this point you make is entirely incorrect. Data Vaults integrate data by business key as long as the business key is defined by the same semantic definition and the same grain. This is just good data modeling practices. It should be the SAME problem with Dimensions (if you are conforming them), so your argument here is a moot point – cancels itself out. You MUST do the same thing to conform a dimension, otherwise you end up creating disparate dimensions. The same exact logic applies to both modelling paradigms.
    6) Real Data Warehousing work is ALWAYS in the cleansing, standardizing, and presentation of the data to the business, making it usable to the business is WHAT we are doing when we construct data marts. This is AGAIN why we separate the sourcing layers (in to the Data Vault) from the presentation layers (in to the data marts).
    7) Nothing ever happens by magic – and unless you can point to “magic” for the Federated Star Schema based data warehouse, I’m afraid this point is moot as well, your argument / comparison does not hold water.

    Thank-you kindly,
    Dan Linstedt

  10. sanjaypande

    In your example, if there is no common ground of the two companies in the way they address/id their customers, how would you handle it in ANY architecture.

    With the DV, at least you have a way to start the process of getting the data out and archiving it, till the same-as links are built. They MUST be driven by the business user because the business owns that data.

    In any other architecture, you’d have to wait for long before you can even get to that stage. Also, with the DV, you’d end up having a richer data set to mine like variables from to assist the business user in their determination of the same accounts across companies.

    Entity/Identity resolution tools have come a long way and are intended exactly for this purpose. For some reason you’re calling this task “Real Data Warehousing” and it’s not.

    The DV architecture doesn’t require the source to have a timestamp for CDC with very few exceptions. It’s built in to the load process architecture. Source timestamp driven CDCs lead to all sorts of errors anyway – especially during testing.

    “How do you do CDC with a big source table that doesn’t even have a update timestamp and no way to touch the database to put triggers or read the logs?”

    As long as you have read privs on the source, it’s not an issue. Triggers or logs are not needed. In a batch scenario, you would usually stage the source data set and depending on how large it is, you can use the “data aging” techniques as described in the methodology.

    The CDC is trivial. The other situation you put up is non-trivial and mundane.

    If you have to merge two businesses, the DV is a better enabler than any other method I’ve seen. The fact that your load processes can remain unchanged and you just have to rebuild marts depending on requirements contributes to a ton of cost savings because it avoids a ton of re-engineering at the storage and archival layer and gives you audit, CDC and integration points for free (except in the situation you pointed out).

    I’ll wager, it’s still better than any other method even for that particular scenario. (And, yes I’ve been in the exact situation more than once in more complex scenarios than account numbers).

  11. “1) the Data Vault is for back-end data warehousing, not for production release to business users directly”

    I’m not missing it. I told I was using the Kimball definition where your “back-end data warehouse” is defined as “a persistent staging area”. I admit I see the DV like that as well. But that’s a definition.

    “2) CDC has nothing to do with the Data Vault”
    “The DV architecture doesn’t require the source to have a timestamp for CDC with very few exceptions. It’s built in to the load process architecture.”

    I’m probably misinterpreting one: either it’s built in or it has nothing to do with but it can’t be both.

    “3) The Data Vault brings business value by lowering complexity, separating (de-coupling) the business changes from the sourcing changes, and lowering maintenance costs.”

    That’s technical value not business. For me it’s increasing complexity as the customer data are now dispatched a possible big number of tables. As the application age so the source model change so the number of satellites per hubs increase therefore the complexity (if I’m correct). Even for you, at some point, the DV will become complex. For mere mortals it’s already complex at the beginning.

    “5) With a Data Vault you do NOT necessarily have disparate galaxies – I don’t know where you got your information, but this point you make is entirely incorrect.”
    Again sanjaypande think the opposite, while noting the advantage of the DV model. I’m not saying that there is a better alternative. Just that the consolidation work is outside the scope of the DV model.

    “6) Real Data Warehousing work is ALWAYS in the cleansing, standardizing, and presentation of the data to the business, making it usable to the business is WHAT we are doing when we construct data marts.”

    If “presentation of the data to the business” == “making it usable to the business” then “Real Data Warehousing work is […] WHAT we are doing when we construct data marts.” and then Real Data Warehousing == data marts i.e. not DV. I’m honest by saying that it’s what I understand by your statement.

    I think we agree that DV is good for back-end data warehousing (also called persistent staging area) and that the star schema is good for the front-end data warehousing (also known as the set of datamarts). Kimball focus on the business side (or presentation side if you prefer) while you focus on the back-end side. So there is no overlap at all. Therefore those new types of slowly changing dimensions have nothing to do with the back-end side and can help DV users as well for their data marts. Just like when Kimball say “All staged data should be archived” without stating how (which enable to recreate star schema data if needed), the DV can be the answer. You both have a back and front side, but are talking only about one.

  12. Sebastien,

    Thank you for your replies. The focus of this post was and is to discuss the advancements of the Dimensional Model that are being proposed by the Kimball camp, the focus of this post is not to discuss the purpose of the Data Vault.

    1) comparing Kimball “data warehouse” which is a federated star schema to a “data vault” data warehouse is comparing apples to oranges. Especially since a federated star schema with conformed dimensions requires business rules and “conformance” of the data set on the way in. This is a fundamental difference between the two – as the Data Vault model stores raw data integrated by business key. Furthermore, Kimball federated star schemas DO allow direct user access, and therefore must be tuned for end-user queries and ad-hoc accessibility. The Data Vault model is NOT tuned this way. It is tuned for mechanical ETL routines (process layers) that occur the same way every time they are run. This also is a fundamental difference.
    2) The Data Vault comparison to the “kimball warehouse method” is not accurate unless you are breaking the “Data Vault” in to its constituant parts: The Data Vault Model – would have to be compared with a persistent staging area, the Data Vault Methodology would have to be compared with the Kimball “Approach”, and the Data Vault 3 tier Architecture would have to be compared with the Kimball 2 tier architecture. I’ve blogged on this before, any other comparisons (as you are suggesting) would be to compare apples to oranges, and it simply doesn’t add up.
    3) CDC is built in to the methodology, the Satellites in a Data Vault are delta driven by nature. The Hubs and Links only load new records where none existed before. This creates audit trails where there were none. But to say the “Data Vault Model” is CDC driven is an incorrect statement. The model is just that: a model. It would be similar to making a silly statement that the “Dimensional Model” is CDC driven, which again is incorrect. The “Dimensional Model” contains a definition of Type 2 Dimensions in which the process that loads the Type 2 dimension is CDC driven. A far more accurate statement.
    IF you want to do “CDC” off the source systems up-stream of either warehousing method, then that is an option for either implementation style – but again, it has nothing to do with the data modeling components.
    4) I disagree. Business feels the impact of: Total cost of ownership of the overall data warehouse, and the ability of the team to be agile and support changes. These are business impacts that directly impact the maintenance, the complexity, and the changes neceesary to evolve the data warehouse and keep it current with the business. These ARE business impacts, like it or not – call a spade a spade. They impact the business users and the project sponsors. With the Data Vault we do 2 week deliverable sprints (just like any other agile method of delivery), and we can lower the overall TCO for most organizations (when compared with “traditional data warehousing paradigms” that they’ve implemented in the past). With the Data Vault model being so flexible, we can also deliver in rapid succession, changes as dictated by the business, without re-engineering the existing architecture (read data model), and without re-engineering existing load processes. These are huge value propositions to the business. Don’t take my word for it, go talk to some of the customers who have implemented successful Data Vaults and ask them what they think.

    If you lower complexity of a system, you lower maintenance costs, and time to market (build/release), both of which affect business value of the overall data warehousing system.

    Regarding the DV becoming complex, I’m afraid I don’t understand your point. Are you referring to the Data Vault Model, Methodology, or Architecture? And what specifically in your mind would make it “become complex” as you say? And if it were complex at the beginning for mere mortals, then why is it so easy to construct Data Vault models, and auto-generate load processes and stand up a Data Vault solution inside of 2 weeks at new customer sites?

    5) No, you mis-read Sanjay’s post. Besides which, please define in exacting terms what you mean by a “disparate galaxy” – and give me a complete example. I think we might be discussing two very different ideas here. And no – you missed the point of consolidation, there IS consolidation within the DV model and DV methodology – the consolidation occurs around the business keys, and the definition of business keys. It is subtle, as we do not change data on the way in, but the “construction” rules and standards that come from the methodology lend themselves to passive data integration (no changes to context nor content), however keys which do exist as master keys across multiple systems, DO have their data sets consolidated around single hub entries. So again, I am not sure what you are referring to.

    If you are referring to “cleansing”, or standardization, or aggregation, well then, those are entirely different things. Those are business rules and affect the interpretation of the data – ie: turning data in to information usable by the business, THAT is left to the ETL processes (again DV methodology specifies this) that build the Data Marts or Business Data Vault downstream. So, what part of consolidation do we miss? None, if you look at it from a complete Data Vault viewpoint. We end up at the same answers as you would with a “federated star schema”, the difference is (as I described above) the reduction in TCO, and Agility (time to market) of delivery.

    6) There is a difference in terminology here. The term: “Data Warehousing” or to “do data warehousing” means something different than a “data warehouse”. Again, if you look at Kimball’s existing book: DW lifecycle toolkit, you will get at least two different definitions of what a data warehouse is. I am merely referring to the same thing. To do “Data warehousing” is to execute on a methodology, and stand up an architecture – which comprises ONE piece: called a Data Warehouse. There in lies the difference.

    The DATA WAREHOUSE in the Data Vault methodology is a Raw Data store integrated by business key (yes, it’s different albeit subtly) from a Kimball Defined PSA. The Data Warehousing methodology and architecture for Data Vault includes the Data Marts as an output layer, this is where we present the end-user with access to the data sets.

    I disagree with your assessment that “constructing data marts == real data warehousing”. There are plenty of cases where we’ve built Data Vault models as an enterprise data warehouse, and part of the “data warehousing project” was to provide compliance, accountability, and auditability. And another part of the process was to combine like terms under master keys, and another part was to turn the data in to information (hence the data marts on the other end).

    In my 25+ years of experience in the industry (government, banking, telecom, manufacturing, insurance, etc…) we’ve always had to define “data warehousing” as a system, made up of three components: the model, the systems architecture, and the methodology.

    There also have been times when slowly changing dimensions become too high cost for maintenance and complexity for the business to bear, and it is at those points that the idea of a “fully conformed enterprise data warehouse built on federated dimensional models” has broken down at the customer sites. But again, don’t take my word for it, seek out the customers in the community who have built successful Data Vaults and put your business questions to them. Coming from me it sounds like evangelism, coming from them you will get the truth. Although I am speaking from customer experience here, this is what they are telling me.

    Hope this helps,
    Dan Linstedt

  13. sanjaypande

    “I’m not missing it. I told I was using the Kimball definition where your “back-end data warehouse” is defined as “a persistent staging area”. I admit I see the DV like that as well. But that’s a definition.”

    I think there is a confusion in terminology. The PSA or Persistent Staging Area is a proponent of folks who use the Kimball architecture. In my opinion, it’s an improvement over volatile staging since you do get audit and CDC in this area. The ETL can be generated and re-generated for this but there are a few issues with this approach:

    1. You don’t get business key integration across the enterprise. You still end up with a bunch of staging tables which are identical to the source.

    2. Handling change isn’t as convenient as the DV and impacts existing routines.

    3. You have much more complexity on the way out since you have to get the grain of the data correct and you still have to do a LOT of work in a single step when pulling the data out.

    4. Because the data is not already patterned, it’s impossible to use any auto-generator for pulling data out (When conversely from the DV, adding a satellite is just another table for the generators because of the patterns).

    So, we in the DV world think of the Kimball style PSA as another stop-gap they’ve added when compliance and audit became an issue. These were already built into the DV from the ground up.

    “I think we agree that DV is good for back-end data warehousing (also called persistent staging area) and that the star schema is good for the front-end data warehousing (also known as the set of datamarts).”

    I would not recommend calling the DV a persistent staging area even if you can compare the two. It’s not staging. It’s a DW.

    Also, we don’t always use Kimball for the data marts. It’s one of the options albeit currently the most popular. We don’t recommend SCDs or conformed mart architecture (BUS) for the marts when you have an underlying DV otherwise you’ll end up with 2 data warehouses and affect agility.

    There is a volatile staging before the DV (for batch style loads) and the marts are ALSO potentially volatile because they’re easy to build. With an underlying DV you can break all the Kimball rules and still deliver to the business very quickly.

    Your interpretation and differentiation of technical value and business value is moot as Dan pointed out because TCO and time to mart delivery does affect business value quite dramatically.

    The DV model is for warehousing the integrated source data sets. The methodology and the systems architecture DOES talk about delivery to the business user and the usage of star-schemas as a front-end as well as other alternatives like the business data vault.

    In the end I think we’re arguing on semantics.

    Is the DV perfect?

    No. But it’s the best alternative out there. It borrows things from both Inmon and Kimball and goes in completely another direction from both of them when talking about business rules.

    Talking about where the value lies is the old Kimball/Inmon debate and the truth is there is value in both the back-end and the front-end.

    The DV is just an enabler for delivering information to business faster by using as much patterning (and if possible automation) as possible and by avoiding as much re-engineering as possible.

    It’s also extremely forgiving allowing you to make mistakes because it provides a buffer layer which in turn enables agile delivery.

    Re: Disparate galaxies

    I think Dan misunderstood what you’re saying because it’s not a normal situation in a DV since everything is integrated by business keys across the enterprise.

    However, if you have two completely different enterprises in the mix, you would indeed have disparate galaxies. Finding out the commonalities is called identity resolution or entity resolution.

    The point I was making was, with the DV, you can still start the process of warehousing the data even before the “Same-As” links are built. You can have sets of data in production prior to the integration of two enterprises and these sets of data can potentially provide enough business value themselves to warrant doing it.

  14. Thanks both of you for this interesting discussion. I think we both agree on the whole picture, just having some minor differences that probably come from our background and experience.

  15. Ruben

    The article incorrectly suggests that Kimball proposes new types of slowly changing dimensions to “innovate” the dimensional model.
    However, all these techniques have been around for many years. The only “new” thing is the assignment of SCD numbers to formally label these techniques.

    (In fact, even some of the new SCD numbers are “old” news. Kimball University already taught SCD type 0 and type 6 back in 2010.)

  16. Hi Ruben, thank you for the clarification. However, Dr. Kimball’s posting clearly indicates these are “NEW types”, and in fact, uses this very language.

    Thank-you kindly,
    Dan Linstedt

  17. Ruben

    Hi Dan, you might want to read the second paragraph of the article again:

    We have written about more advanced SCD patterns, such as the 2005 article entitled ‘Slowly Changing Dimensions are Not Always as Easy as 1, 2, 3.’ However, we’ve not consistently named the more advanced and hybrid techniques. With the upcoming third edition of The Data Warehouse Toolkit (Wiley, Jun/Jul 2013), we’ve decided to assign ‘type numbers’ to several techniques that have been described, but not precisely labeled in the past. Our hope is that more specific technique names will facilitate clearer communication between DW/BI team members.

    As said, the SCD numbering might be new but the techniques have been around for a long time now.

    Thank you,
    Ruben van der Linden

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>