3d illustration of computer keyboard with the script "scalable process" on two adjacent pale blue buttons

Updates in Data Vault are DEAD! Finally

hello everyone, i am bringing you the exciting news: if you want your data vault 2.0 to be 100% insert compliant (as it was in the beginning, way back in 2001) then remove last seen dates, and remove load end dates.

i’m here to tell you, that if you aren’t focused on an insert only architecture for your data vault 2.0, then you aren’t focused in the right place.  due to scalability (volume), and sheer arrival speed (velocity) of data, your data vault 2.0 model should be highly focused on 100% insert only architecture.

last seen dates were always a kludge, a crutch, to help developers get around “full table dumps” and detecting data that disappeared.  load end dates were always a kludge, a crutch, to help developers issue between clauses. on query access.

well, i’m happy to say: there is a way forward without either of these two kludgy attributes.

  1. implement record source tracking as defined in my book: building a scalable data warehouse with data vault 2.0, or as defined in my cdvp2 class.   what???  you’re not yet certified?  why?  what’s stopping you?  i am offering two additional courses this year, you can sign up for them at: http://datavaultalliance.com
  2. implement point in time and bridge tables – and do it properly!!!  again, as defined in my book: building a scalable data warehouse with data vault 2.0, or in my cdvp2 class.

by the way, the benefits of properly implementing these two paradigms far outweigh the costs of “updates” against the data vault model.  furthermore, these two structures move your entire edw dv2 in to the 21st century, and make it 100% insert only compliant.

moving to 100% insert only architecture is currently (and in to the future) the best thing you can do for your edw dv2.

if you have questions or comments, feel free to post them below.

thank-you kindly,
(c) dan linstedt, 2016 all rights reserved

Tags: , , , , ,

34 Responses to “Updates in Data Vault are DEAD! Finally”

  1. Kent Graziano 2016/08/09 at 9:17 am #

    Totally agree Dan! An “insert only” architecture is the only way to go these days if you are at all concerned about Volume & Velocity of inbound data streams. Glad to see this is now “official”

  2. Dan Linstedt 2016/08/09 at 9:49 am #

    Hi Kent, thanks for the feedback. Yep, it’s finally official. I will be removing all annotations to these items from the CDVP2 training class. Well, I will still teach end-dates, but only as a secondary option, and I will list all the risks (which are many if one uses compression in the database). What people need to now understand is: proper use of PIT and Bridge tables, which I will blog on shortly. Thanks for the support, Dan

  3. Andreas 2016/08/10 at 12:53 am #

    Totally agree.

    1.) End dating satellites = updates are too slow for huge data amounts. These updates make the Raw Data Vault LOAD JUST UNNECESSARY LONGER
    2.) Use max(loaddate) for each query on satellites instead? No! SQL statements are complex, execution times are bad and explain plans are terrible! PERFORMANCE TUNING NIGHTMARE!

  4. Dan Linstedt 2016/08/10 at 5:33 am #

    If you are using max(loaddate) to produce data for end-user queries, then you do not understand PIT and Bridge tables and their function. However, that said: you only have two choices:
    1) use Load and Load End Dates to build the PIT & Bridge Images
    2) use LEAD OLAP functions (as Kent Graziano stated in his blog entries) to BUILD the PIT and Bridge images.

    Once the PIT and Bridge Images are built, the end-user AD-HOC queries should be equal join direct to Satellite (not needing end-dates or neseted select max(loaddate) queries anymore). If you cannot get equal-join to work against the PIT or Bridge, then: a) the PIT and bridge have not been properly setup with ghost records, b) the PIT and Bridge have not been properly designed.

    Hope this helps,

  5. Andreas 2016/08/10 at 11:09 pm #

    Hi Dan,

    sorry for the confusion.

    I wanted to say that using max(loaddate) on SAT tables is cumbersome: SQL statements are complex, execution times are bad and explain plans are terrible.

    Using PIT tables in joins to SAT tables instead is the faster approach. Easier join conditions and the possibility to avoid outer joins with ghost records.

  6. Dave 2016/09/13 at 8:12 am #

    does this mean that handling of Driving Keys is now handled in PIT tables (i.e. no need for the ‘Load End Date’ method described on pg 121 of ” Building a scalable data warehouse with Data Vault 2.0, “?

  7. Eric 2016/10/03 at 6:06 am #

    Hey guys,

    I have been doing so for the last 1,5 years! Cool that it finally is official, that way I don’t have to argue to get my point across 😉



  8. Evgeniy 2016/11/13 at 9:29 pm #

    Hi Dan,

    Does this mean that we will not be able to establish in Raw Vault and beyond (Business Vault / Data Marts) that the record does not appear in the source system anymore?

    Are you saying that there are no legitimate business cases that require this information?


  9. Dan Linstedt 2016/11/14 at 3:52 am #

    Hi Evgenly,
    I am not understanding your question. But I will attempt to guess. The point here is this: end-dates are not required as a part of the architecture in order to make your data auditable. End-dates were NEVER required. More specifically Load End Dates. I am not referring to “business driven end dates” that come in from the source feeds, I am however referring specifically to load end dates. Now, if you are stuffing a load-end date with a business driven source date, then your load pattern is flawed and needs to be corrected.

    Load-End-Dates are system driven values only, and should never be used or seen by business. That said, you can (if you want) calculate the span (between load dates and load end dates) and materialize that span inside of a point-in-time or bridge table (this is the appropriate / proper design technique).

    I cover all of this and more in my CDVP2 course

    Dan Linstedt

  10. Evgeniy 2016/11/14 at 6:35 pm #

    Hi Dan,

    Thank you for the detailed response – let me explain in more details…

    Unless you capture load end date of a satellite record in some way in Raw Vault , e.g. using Load Effective From / Load Effective To model (example of a satellite table below)

    CREATE TABLE [Sat Customer Details] ([Customer Hash Key], Load_Effective_From, Load_Effective_To, Name, Address, … )

    …or some other format, the downstream layers – Business Vault & Data Marts will not know that a satellite record does not appear in the source system anymore.

    E.g. a customer was soft – or hard-deleted in the source system, therefore it should not appear in the data mart customer dimension.

    Are you proposing to capture Load End Date (or Load Effective To) date as a separate insert in a satellite table in Raw Vault?

    Where will PIT table generation code find out that this satellite record does not exist in the source?

    Thank you,

  11. Dan Linstedt 2016/11/15 at 7:42 am #

    Unfortunately you have changed the meaning of the Load End Date. You appear to be using it as a deleted date marker. This is completely incorrect – it overloads the definition of Load End Date and defeats the primary purpose of what the Load End Date truly represents.

    If you want deleted dates or to track deleted satellite data, you should be using an added attribute called “deleted date” in the Sat itself, or leveraging Status Tracking Sattellites.

    Load End Dates have one and only one purpose: to show that the satellite record has been superseded by a new delta. This, can and should be calculated on the way out – to the PIT and Bridge tables, where the span of the record is captured and stored in a physical manner.

    Soft and hard deletes are handled completely different (as I suggest here). I cover all of this in deep detail in my CDVP2 class, the book: Building a Scalable Data Warehouse with Data Vault 2.0 also covers these aspects, complete with query code.

    Unforunately you’ve made a mistake in assigning a “deleted date” to be the Load End Date Value – which is the wrong way to approach this problem.

    I am sorry about that, but this is the change that needs to be made / corrective action to be applied.

    Hope this helps a bit,
    Dan Linstedt

  12. Krzysztof Rychlik 2016/12/02 at 10:56 am #

    hi guys,

    I got a question on this and I am hoping someone can shed some light for me. I am new to DV and only recently did I start implementing it. So apologies up front if my question is silly. 🙂 I feel like I m missing something very obvious.

    I have been doing updates to my Load End Date fields. I now want to perform inserts only as recommended. Using the lead/lag function in SQL Server I am able to display end dates when ever I need so I don’t need to perform an actual update to the end date field.

    However, when I do an insert to my satellite I still need to be able to select the most recent record from my satellite and compare it to my record in staging to see if any of the descriptive fields have changed. I was using the where clause before WHERE STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD AND SAT.LOADENDDATE IS NULL

    Since I wont be using the LOADENDDATE field anymore, how can I know which record is the most active one? If I don’t use the most recent one I will adding the new record from staging over and over again each time the job runs because there is always going to be a record in the satellite (the old record) where the STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD

    So the solution to it is to do inserts with my left outer join but using the window function. So I do my left outer join to a view where I select the most active record from the satellite using the lead/lag function – but wouldn’t that be equally costly to an update? My gut tells me this is not the correct approach

    What am I missing? what is the correct approach to this?


  13. Piotr 2016/12/07 at 11:27 am #

    I’m working on a DV model at the moment and would like to get rid of load end dates but I’m lacking few things – I’m not sure how to implement the PIT refreshes. I understand that I need to to refresh the PIT table after each load of the related Hub and its Satellites? Then also a process of deleting old snapshots will be needed – I guess as described in the book – so that the PIT table to not grow to infinity?
    What about models dedicated for real-time loads? I feel like it won’t be feasible to add a new snapshot to the PIT table after each/couple of key inserts to the Hub and/or Satellites. How to tackle this?

  14. Piotr 2016/12/12 at 10:20 am #

    I’ve got another question – as per the last seen date in the hubs and links – in order to get rid of them I understand that you recommend using the record tracking satellites (chapter 5.3.5). I’ve got a link with 3 million records and if I understand properly it means 3 million records will be added to the record tracking satellite by each load. I don’t think this is acceptable. Do I understand the idea correctly at all?

  15. Steve 2017/03/27 at 8:37 am #

    Hi Dan, I’d like your input regarding Driving Keys as well. We’re actually realising the LoadEndDate serves little purpose for us, since we are “virtually” creating PIT tables, but wonder how to manage it without updating the loadenddate during the load of links driven by LDK.

    Note that we have a particular use case here, we are loading and presenting data in near-real-time. Our “batches” are run every 15 minutes and we’d like to bring this to a 5 minute refresh rate.
    I know the PIT should be “snapshot”, but I fear this to be hard on a 15 minutes refresh rate, when were are speaking of couple hundred millions satellites and links.

    I’ll be attending your class in Vermont in April, but hints would be great since we are already rolling into production as I write these lines and the normal load just takes way too long because of the updates.

    Kind regards,

  16. Dan Linstedt 2017/04/13 at 7:43 am #

    Hi Steve,

    Surprise Surprise, Driving Keys are a Modeling Mistake… The reality is much simpler than written in to the book. The problem is: the Satellite is in the wrong place.. Move the Satellite to the HUB where the Driving Key “lives”, and the problems are all solved. I will be posting on this soon, very soon. Now as for PITS – and BRIDGE loads: use OLAP and CTE functions to calculate spans / begin and ends, etc.. that are Business Driven, and put those in to physical columns in the PIT and BRIDGE structures, then the business user queries can leverage between clauses with indexed fields.

    On a 5 minute or 15 minute refresh rate, you should build the load of the PIT in to the process of detecting only “deltas” to load to the PIT table. In other words, some of the advanced techniques for “loading PIT’S” I’ve never documented or offered a class on, the idea is an incremental PIT instead of a full load for every cycle. there are a couple of techniques for that, I will be discussing that soon. But we can talk about hundred million row tables, and billion row tables – I have customers using both successfully. I’d need to know more about your back end RDBMS before I can comment.

    The class in April will touch on these things, but sadly, I don’t have enough time to dive deep in to advanced concepts… that said, after class, or before class we will have some time to do it.

    I think, perhaps, it may be best to secure some remote consulting time of mine, we can work this out over the web. Feel free to contact me directly: DanLinstedt ( @ ) gmail.com

    Dan Linstedt

  17. Dave 2017/06/20 at 9:35 am #

    Hi Dan,

    did you have an answer for Krzysztof Rychlik regarding how to identify the most recent Satellite record against which to compare the HashDiff when we no longer have a LoadEndDate?

  18. Dan Linstedt 2017/06/20 at 10:35 am #

    Hi Krzysztof,

    1) Load End Dates do not nor have they ever carried any business value or meaning
    2) People misuse and incorrectly apply load-end-dates, they overload the meaning and cause problems for queries and understanding.
    3) You can & should use OLAP and CTE functions (LEAD & LAG) to calculate the load-end-dates on the way to the PIT and BRIDGE structures IF you truly need them, but in reality you should use the BUSINESS dates that arrive from the source system instead.
    4) Load End Dates CANNOT be maintained in real-time, the DO NOT WORK because streamed transactions most of the time, DO NOT arrive in the order in which they were created. So for this reason alone, Load End Dates are DEAD. Do NOT use them.
    5) I teach ALL of this and more in my CDVP2 class (http://DataVaultCertification.com) you can check it out, possibly try to attend and get certified if you can, it will answer all your questions and more.
    6) to know where the most active record, use the sub-select MAX query, or apply the query against the PIT table to find the most active / most recent record. This is described in detail in Chapter 6 of the book: Building a Scalable Data Warehouse with Data Vault 2.0.

    No, Left outer joins are not always the proper technique. That depends on the database, the indexing, the partitioning, and so on…

    Hope this helps,

  19. Dan Linstedt 2017/06/20 at 10:38 am #

    Hi Piotr,

    This is all covered in the CDVP2 class, please plan on attending the Certification course for Data Vault 2.0.

    The short answer is: The PIT and Bridge tables should both be rolling windows of time. Not ALL history belongs in the PIT, and furthermore, you don’t need to keep JUST one PIT, sometimes we create multiple PIT and Bridge tables to meet different business needs.

    Real-time is covered during my CDVP2 course, it is a 2 hour lecture, and contains far too much information to describe the answer here. I am sorry about that. It is definitely feasible, and can be done – we’ve built it in many many many solutions over the past 10 years.

    Thank-you kindly,
    Dan Linstedt

  20. Dan Linstedt 2017/06/20 at 10:40 am #

    You understand the base idea, and Record Tracking is feasible as long as you turn on compression. 3 Million records really is not that much at all, and with Record Tracking done properly, compression should buy you 85% savings or more. Yes, Last Seen Dates are also DEAD… Anything that introduces an UPDATE / physical DB update, is Dead. Because it does not scale, and it does not work properly in real-time. This is the nature of what we do, until we have access to new technology like “temporal tables” from Microsoft SQLServer 2016.

    Hope this helps,
    Dan Linstedt

  21. Dan Linstedt 2017/06/20 at 10:41 am #

    Hi Dave,

    Just posted an answer. Thanks.

    Hope this helps,

  22. Eduardo Grosso 2017/08/06 at 11:57 pm #

    Hi Dan,
    It is interesting to see how confusing these end dates are. I believe what most people need is a business version end date for a data entity instance. I.e. The version of the entity ceased to exist in this “end date”. For example if a customer changed the address, and this is captured in a satellite table depending on the customer hub, and the satellite table contains only addresses, the version of the element of this particular satellite table needs to somehow indicate that is part of the past. Now sometimes there are no “business dates” that will tell you this. Some systems just change the address of the customer, overriding whatever address the customer previously had. Hence we need to fabricate an end date, as in the DW we want to capture the temporal aspect of these changes, although this is not captured on the source. This implies we need to introduce a new data element, the end date of the address where this customer ceased to live in. This date as mentioned, is not coming from the source, but we need to introduce it, either as “version start date” and/or “version end date”. Also we need to consider that the “version start date” alone is not enough, as this implies there will be a new version with a new “version start date”, but this may not be true. The customer ceased to live in an address, but we may not have his/her new address, how can we resolve this business problem them, without start and end dates?

  23. Dan Linstedt 2017/08/07 at 8:22 am #

    Thank you for your comment. You are correct in stating that business needs BUSINESS driven end-dates / start dates. What I will say is this: You are focusing (once again) on the wrong thing. Load Dates and Load End Dates are SYSTEM driven for the data warehouse only, They are MEANINGLESS to the business. (sorry for the emphasis).

    Load End Dates are DEAD. and will remain this way. In fact, NO “physical database update” can survive in the ever growing data sets we live in, and any attempt to “issue an update statement” in the database, will *require* re-engineering and re-design / elimination as the data set grows beyond the platforms ability to execute in a timely manner. Furthermore, again: load end dates are system driven. Once you load data from multiple time zones to a single warehouse, “load end dates” get confused, and become useless. Once you load data to a cloud driven data warehouse, Load Dates AND load end dates become confused and become useless…. So this begs the question: how and when can you / should you track “status of records disappearing off source systems???”

    Valid question: the answer is two-part:
    1) ALL ROADS LEAD TO CDC (CHANGE DATA CAPTURE) this is the ONLY sure fire auditable method for executing full compliance with GDPR laws and regulations. Don’t have CDC???? the EDW WILL FAIL AN AUDIT, AND WILL NOT BE IN COMPLIANCE WITH THE NEW GDPR LAWS and regulations!!!

    2) Best guess – ok, so reality says: it will take years to comply fully with GDPR or complete auditability, alright – so best guess comes in. This means calculations for rows disappearing “have to happen” – we do this using Record Source Tracking Satellites, and we compute Physical End Date columns on the way to the Business Vault or the Information Marts.

    I teach all of this in the CDVP2 class, happy to discuss with you – engage in a few hours of consulting if you like.

    Thank you kindly,
    Dan Linstedt

  24. J.P. 2017/09/13 at 3:22 am #

    “If you want your Data Vault 2.0 to be 100% insert compliant REMOVE last seen dates, and REMOVE load end dates.”

    “2. Implement Point In Time and Bridge Tables…”

    The book “Building a scalable data warehouse with Data Vault 2.0” defines [load end date] as a required attribute in satellites (p.116, p.117). SQL examples in the book, including the example for loading a PIT table (p. 599) , make use of load end dates. Is the book I just bought out-dated? Is an erratum available?

  25. Dan Linstedt 2017/09/13 at 4:46 am #

    Hi JP,

    Thanks for buying our book. It was published over 3 years ago. 90% of what is stated is still valid. There are only minor changes along the way. The changes are taught in class (CDVP2 – Certified Data Vault 2.0 Practitioner). Unfortunately the publisher has not given us any sort of erratum pages to provide the changes to the public. It is a good idea, so I will try and start one here on this site.

    That said, anything I publish on this blog that contradicts the book should be considered correct and valid going forward. You can use Load End Dates – but: they are officially dead, not applied – because of volume constraint issues, and real-time loading issues.

    Hope this helps,
    Dan Linstedt

  26. Matt E 2017/10/18 at 6:30 pm #

    Hi Dan

    I would be grateful if you could just clarifying something for me in practical terms. I think it’s in a similar vein to Steve’s question previously.

    Obviously the goal is to create an INSERT ONLY architecture, and I’m fully on-board with no end dates for satellites.

    However the bit I’m slightly confused about is with regards to PIT tables, as there seems to be a conflict in some of the replies above between on the one hand saying use LEAD/LAG functions to generate the appropriate end dates virtually when required as this would allow iinserts only into the PIT table after each load (I presume for ease of use in a suitable view over the PIT table) but then you also state about materialising these end dates in the PIT tables which would require UPDATES to the PIT tables on the addition of new records going against the INSERT ONLY principle.

    “Now as for PITS – and BRIDGE loads: use OLAP and CTE functions to calculate spans / begin and ends, etc.. that are Business Driven, and put those in to physical columns in the PIT and BRIDGE ”

    I just wondered whether you could clarify what and how is the correct approach

    Many Thanks

  27. Dan Linstedt 2017/10/29 at 6:22 pm #

    No, if you materialze the PIT and BRIDGE, they still remain insert only. PIT and Bridge house read-only, insert only snapshots for data AS OF a given date (the snapshot date). There will never be updates against the PIT or bridge tables. The PITS and Bridges CAN be made to be incremental, BUT then there is a need to balance their usefulness against the size of data, against the power of the platform, against the performance of updates. Even then, a delete / insert combination can be faster than an update – especially if compression is turned on.

    Happy to help you out if you think it would be useful.

    Hope this helps,
    Dan Linstedt

  28. Madina 2018/02/07 at 3:54 pm #

    when you refer to “Record Source Tracking” in your book, do you mean “Record Tracking Satellites” from Chapter 5?

  29. Dan Linstedt 2018/02/12 at 9:50 am #


  30. Joe 2018/05/02 at 2:26 pm #


    I’ve taken the advice on this thread to heart and have abandoned the last_seen/load_end_dts in Satellite tables.

    I have a design question with regard to effectivity sats though. Simply, can record-tracking tables serve in place of the effectivity sats? After removing load_end_dates I realize now that a candidate effectivity Sat only has two columns! (link_key and load_dtm), very unsettling.

    Use case: There is an order table with a FK to sales person. A salesperson can have many orders but each order can belong to only 1 salesperson. After an audit, the sales manager changes the salesperson on the order. This results in two link records (LinkSalesPersonOrder). However, lets say the orginal sales person appeals the decision by the manager which results in the order being assigned back to the orginal salesperson. The link still has two records.

    In the past, I would use an effectivity sat (with load_end_dt) that would hang of LinkSalesPersonOrder. However, if I take away load_end_dt then I have a effectivty sat with only the link key and load_dtm, which will work, but sure doesn’t look elegant.

    Should a record tracking sat be created on a link? I suppose it would have a structure like:

    LinkSalesOrderKey char(32)
    LoadDtm datetime2
    Appearance bit


  31. Dan Linstedt 2018/05/03 at 9:40 am #

    Hi Joe,

    Thanks for the comments.. just a few notes…

    (for others): We teach all these concepts and more in CDVP2 class (Certified Data Vault 2.0 Practitioner course)

    1) You *can* compute load-end-dates in to PITs and Bridges if you really want them. But PITS and Bridges are also 100% insert driven
    2) you *can* leave “end-dates” in the Record source tracking, BUT you always INSERT a new record when the “end-date” is populated (this way you have two records, one with NULL end-date (or future), and another with an actual end-date.. (you have to adjust your queries to work with this unfortunately – work with the “latest” record).
    3) Load dates and load end dates are not the right date combination for this type of data span. In fact, there should be applied dates, or business begin business end, or transaction begin, transaction end (all generated by source systems). This can only be explained in an hour long lecture on Real-Time load-dates ad resolution.

    Record Source Tracking can be created on a Link, it would function as you have defined it. BUT I only insert a record in record source tracking WHEN it appears on the source. So the appearance bit flag – wouldn’t work very well, – well it would work fine, just that it would require additional records and additional processing to get “right”. I will attempt to create a micro-learning lesson on this topic going forward.

    Record Source Tracking is truly all about the key, or group of keys. Nothing else – usually doesn’t have any relation to effectivity… although I suppose you *could* combine the two if you really wanted.

    Hope this helps a bit,
    Dan L

  32. Jeroen Kuiper 2018/07/10 at 1:01 pm #

    Hi Dan,

    on 2017/04/13 you answered on a question about driving keys and mentioned that the satellite should be moved to the Hub. You mentioned that you would post on this soon, but if you did I missed your post.
    I did it a bit differently, but the structure of the resulting satellite is the same if I understand your answer above correctly. I started with the regular (effectivity) satellite to the link table. Next step was adding the hash key for the driving key to the (effectivity) satellite to avoid a lookup to the link table. The satellite now contains both the link hash key, the hub hash key of the driving key and the start date of the relationship.
    In the logical data model you can define unique constraints on both the link key + start date and the hub key + start date. The load pattern does become a hub load pattern but instead of the hash diff comparison you compare the hash link key between stage and satellite. Correct?

    Jeroen Kuiper

  33. Dan Linstedt 2018/07/10 at 1:34 pm #

    Hi Jeroen,
    It is not allowed to have a double parent table for a single satellite. This is a bad modeling practice – whether its done in 3rd normal form, dimensional modeling or data vault. There is no data modeling standard that allows a Satellite to have both the Hub and the Link hash key, and no – this is not an acceptable practice. There are other ways, but I haven’t posted on this.

    Sorry about that. This model will fail a DV model assessment. To “avoid lookups” use the Point in time and bridge tables on the way out of the raw data vault. Somewhere you have to execute the joins – there is no way around it.


  34. Jeroen Kuiper 2018/07/11 at 3:48 am #

    Hi Dan,

    I understand your comment on the double parent for a single satellite. So instead of adding the hub key to the (effectivity) satellite it is going to replace the link key, causing the satellite to move to the hub table.


Leave a Reply