3D illustration of computer keyboard with the script "Scalable Process" on two adjacent pale blue buttons

Updates in Data Vault are DEAD! Finally

Hello everyone, I am bringing you the exciting news: If you want your Data Vault 2.0 to be 100% insert compliant (as it was in the beginning, way back in 2001) then REMOVE last seen dates, and REMOVE load end dates.

I’m here to tell you, that if you aren’t focused on an insert only architecture for your Data Vault 2.0, then you aren’t focused in the right place.  Due to scalability (volume), and sheer arrival speed (velocity) of data, your Data Vault 2.0 model should be highly focused on 100% insert only architecture.

Last seen dates were always a kludge, a crutch, to help developers get around “full table dumps” and detecting data that disappeared.  Load end dates were always a kludge, a crutch, to help developers issue between clauses. on query access.

Well, I’m happy to say: there is a way forward without either of these two kludgy attributes.

  1. Implement Record Source Tracking as defined in my book: Building a scalable data warehouse with Data Vault 2.0, OR as defined in my CDVP2 class.   What???  You’re not yet certified?  Why?  What’s stopping you?  I am offering two additional courses this year, you can sign up for them at: http://DataVaultCertification.com
  2. Implement Point In Time and Bridge Tables – AND DO IT PROPERLY!!!  Again, as defined in my book: Building a Scalable Data Warehouse with Data Vault 2.0, OR in my CDVP2 class.

By the way, the benefits of properly implementing these two paradigms far outweigh the costs of “updates” against the Data Vault model.  Furthermore, these two structures move your entire EDW DV2 in to the 21st century, and make it 100% insert only compliant.

Moving to 100% insert only architecture is currently (and in to the future) the best thing you can do for your EDW DV2.

If you have questions or comments, feel free to post them below.

Thank-you kindly,
(C) Dan Linstedt, 2016 all rights reserved

Tags: , , , , ,

16 Responses to “Updates in Data Vault are DEAD! Finally”

  1. Kent Graziano 2016/08/09 at 9:17 am #

    Totally agree Dan! An “insert only” architecture is the only way to go these days if you are at all concerned about Volume & Velocity of inbound data streams. Glad to see this is now “official”

  2. Dan Linstedt 2016/08/09 at 9:49 am #

    Hi Kent, thanks for the feedback. Yep, it’s finally official. I will be removing all annotations to these items from the CDVP2 training class. Well, I will still teach end-dates, but only as a secondary option, and I will list all the risks (which are many if one uses compression in the database). What people need to now understand is: proper use of PIT and Bridge tables, which I will blog on shortly. Thanks for the support, Dan

  3. Andreas 2016/08/10 at 12:53 am #

    Totally agree.

    1.) End dating satellites = updates are too slow for huge data amounts. These updates make the Raw Data Vault LOAD JUST UNNECESSARY LONGER
    2.) Use max(loaddate) for each query on satellites instead? No! SQL statements are complex, execution times are bad and explain plans are terrible! PERFORMANCE TUNING NIGHTMARE!

  4. Dan Linstedt 2016/08/10 at 5:33 am #

    If you are using max(loaddate) to produce data for end-user queries, then you do not understand PIT and Bridge tables and their function. However, that said: you only have two choices:
    1) use Load and Load End Dates to build the PIT & Bridge Images
    2) use LEAD OLAP functions (as Kent Graziano stated in his blog entries) to BUILD the PIT and Bridge images.

    Once the PIT and Bridge Images are built, the end-user AD-HOC queries should be equal join direct to Satellite (not needing end-dates or neseted select max(loaddate) queries anymore). If you cannot get equal-join to work against the PIT or Bridge, then: a) the PIT and bridge have not been properly setup with ghost records, b) the PIT and Bridge have not been properly designed.

    Hope this helps,

  5. Andreas 2016/08/10 at 11:09 pm #

    Hi Dan,

    sorry for the confusion.

    I wanted to say that using max(loaddate) on SAT tables is cumbersome: SQL statements are complex, execution times are bad and explain plans are terrible.

    Using PIT tables in joins to SAT tables instead is the faster approach. Easier join conditions and the possibility to avoid outer joins with ghost records.

  6. Dave 2016/09/13 at 8:12 am #

    does this mean that handling of Driving Keys is now handled in PIT tables (i.e. no need for the ‘Load End Date’ method described on pg 121 of ” Building a scalable data warehouse with Data Vault 2.0, “?

  7. Eric 2016/10/03 at 6:06 am #

    Hey guys,

    I have been doing so for the last 1,5 years! Cool that it finally is official, that way I don’t have to argue to get my point across 😉



  8. Evgeniy 2016/11/13 at 9:29 pm #

    Hi Dan,

    Does this mean that we will not be able to establish in Raw Vault and beyond (Business Vault / Data Marts) that the record does not appear in the source system anymore?

    Are you saying that there are no legitimate business cases that require this information?


  9. Dan Linstedt 2016/11/14 at 3:52 am #

    Hi Evgenly,
    I am not understanding your question. But I will attempt to guess. The point here is this: end-dates are not required as a part of the architecture in order to make your data auditable. End-dates were NEVER required. More specifically Load End Dates. I am not referring to “business driven end dates” that come in from the source feeds, I am however referring specifically to load end dates. Now, if you are stuffing a load-end date with a business driven source date, then your load pattern is flawed and needs to be corrected.

    Load-End-Dates are system driven values only, and should never be used or seen by business. That said, you can (if you want) calculate the span (between load dates and load end dates) and materialize that span inside of a point-in-time or bridge table (this is the appropriate / proper design technique).

    I cover all of this and more in my CDVP2 course

    Dan Linstedt

  10. Evgeniy 2016/11/14 at 6:35 pm #

    Hi Dan,

    Thank you for the detailed response – let me explain in more details…

    Unless you capture load end date of a satellite record in some way in Raw Vault , e.g. using Load Effective From / Load Effective To model (example of a satellite table below)

    CREATE TABLE [Sat Customer Details] ([Customer Hash Key], Load_Effective_From, Load_Effective_To, Name, Address, … )

    …or some other format, the downstream layers – Business Vault & Data Marts will not know that a satellite record does not appear in the source system anymore.

    E.g. a customer was soft – or hard-deleted in the source system, therefore it should not appear in the data mart customer dimension.

    Are you proposing to capture Load End Date (or Load Effective To) date as a separate insert in a satellite table in Raw Vault?

    Where will PIT table generation code find out that this satellite record does not exist in the source?

    Thank you,

  11. Dan Linstedt 2016/11/15 at 7:42 am #

    Unfortunately you have changed the meaning of the Load End Date. You appear to be using it as a deleted date marker. This is completely incorrect – it overloads the definition of Load End Date and defeats the primary purpose of what the Load End Date truly represents.

    If you want deleted dates or to track deleted satellite data, you should be using an added attribute called “deleted date” in the Sat itself, or leveraging Status Tracking Sattellites.

    Load End Dates have one and only one purpose: to show that the satellite record has been superseded by a new delta. This, can and should be calculated on the way out – to the PIT and Bridge tables, where the span of the record is captured and stored in a physical manner.

    Soft and hard deletes are handled completely different (as I suggest here). I cover all of this in deep detail in my CDVP2 class, the book: Building a Scalable Data Warehouse with Data Vault 2.0 also covers these aspects, complete with query code.

    Unforunately you’ve made a mistake in assigning a “deleted date” to be the Load End Date Value – which is the wrong way to approach this problem.

    I am sorry about that, but this is the change that needs to be made / corrective action to be applied.

    Hope this helps a bit,
    Dan Linstedt

  12. Krzysztof Rychlik 2016/12/02 at 10:56 am #

    hi guys,

    I got a question on this and I am hoping someone can shed some light for me. I am new to DV and only recently did I start implementing it. So apologies up front if my question is silly. 🙂 I feel like I m missing something very obvious.

    I have been doing updates to my Load End Date fields. I now want to perform inserts only as recommended. Using the lead/lag function in SQL Server I am able to display end dates when ever I need so I don’t need to perform an actual update to the end date field.

    However, when I do an insert to my satellite I still need to be able to select the most recent record from my satellite and compare it to my record in staging to see if any of the descriptive fields have changed. I was using the where clause before WHERE STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD AND SAT.LOADENDDATE IS NULL

    Since I wont be using the LOADENDDATE field anymore, how can I know which record is the most active one? If I don’t use the most recent one I will adding the new record from staging over and over again each time the job runs because there is always going to be a record in the satellite (the old record) where the STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD

    So the solution to it is to do inserts with my left outer join but using the window function. So I do my left outer join to a view where I select the most active record from the satellite using the lead/lag function – but wouldn’t that be equally costly to an update? My gut tells me this is not the correct approach

    What am I missing? what is the correct approach to this?


  13. Piotr 2016/12/07 at 11:27 am #

    I’m working on a DV model at the moment and would like to get rid of load end dates but I’m lacking few things – I’m not sure how to implement the PIT refreshes. I understand that I need to to refresh the PIT table after each load of the related Hub and its Satellites? Then also a process of deleting old snapshots will be needed – I guess as described in the book – so that the PIT table to not grow to infinity?
    What about models dedicated for real-time loads? I feel like it won’t be feasible to add a new snapshot to the PIT table after each/couple of key inserts to the Hub and/or Satellites. How to tackle this?

  14. Piotr 2016/12/12 at 10:20 am #

    I’ve got another question – as per the last seen date in the hubs and links – in order to get rid of them I understand that you recommend using the record tracking satellites (chapter 5.3.5). I’ve got a link with 3 million records and if I understand properly it means 3 million records will be added to the record tracking satellite by each load. I don’t think this is acceptable. Do I understand the idea correctly at all?

  15. Steve 2017/03/27 at 8:37 am #

    Hi Dan, I’d like your input regarding Driving Keys as well. We’re actually realising the LoadEndDate serves little purpose for us, since we are “virtually” creating PIT tables, but wonder how to manage it without updating the loadenddate during the load of links driven by LDK.

    Note that we have a particular use case here, we are loading and presenting data in near-real-time. Our “batches” are run every 15 minutes and we’d like to bring this to a 5 minute refresh rate.
    I know the PIT should be “snapshot”, but I fear this to be hard on a 15 minutes refresh rate, when were are speaking of couple hundred millions satellites and links.

    I’ll be attending your class in Vermont in April, but hints would be great since we are already rolling into production as I write these lines and the normal load just takes way too long because of the updates.

    Kind regards,

  16. Dan Linstedt 2017/04/13 at 7:43 am #

    Hi Steve,

    Surprise Surprise, Driving Keys are a Modeling Mistake… The reality is much simpler than written in to the book. The problem is: the Satellite is in the wrong place.. Move the Satellite to the HUB where the Driving Key “lives”, and the problems are all solved. I will be posting on this soon, very soon. Now as for PITS – and BRIDGE loads: use OLAP and CTE functions to calculate spans / begin and ends, etc.. that are Business Driven, and put those in to physical columns in the PIT and BRIDGE structures, then the business user queries can leverage between clauses with indexed fields.

    On a 5 minute or 15 minute refresh rate, you should build the load of the PIT in to the process of detecting only “deltas” to load to the PIT table. In other words, some of the advanced techniques for “loading PIT’S” I’ve never documented or offered a class on, the idea is an incremental PIT instead of a full load for every cycle. there are a couple of techniques for that, I will be discussing that soon. But we can talk about hundred million row tables, and billion row tables – I have customers using both successfully. I’d need to know more about your back end RDBMS before I can comment.

    The class in April will touch on these things, but sadly, I don’t have enough time to dive deep in to advanced concepts… that said, after class, or before class we will have some time to do it.

    I think, perhaps, it may be best to secure some remote consulting time of mine, we can work this out over the web. Feel free to contact me directly: DanLinstedt ( @ ) gmail.com

    Dan Linstedt

Leave a Reply