3D illustration of computer keyboard with the script "Scalable Process" on two adjacent pale blue buttons

Updates in Data Vault are DEAD! Finally

Hello everyone, I am bringing you the exciting news: If you want your Data Vault 2.0 to be 100% insert compliant (as it was in the beginning, way back in 2001) then REMOVE last seen dates, and REMOVE load end dates.

I’m here to tell you, that if you aren’t focused on an insert only architecture for your Data Vault 2.0, then you aren’t focused in the right place.  Due to scalability (volume), and sheer arrival speed (velocity) of data, your Data Vault 2.0 model should be highly focused on 100% insert only architecture.

Last seen dates were always a kludge, a crutch, to help developers get around “full table dumps” and detecting data that disappeared.  Load end dates were always a kludge, a crutch, to help developers issue between clauses. on query access.

Well, I’m happy to say: there is a way forward without either of these two kludgy attributes.

  1. Implement Record Source Tracking as defined in my book: Building a scalable data warehouse with Data Vault 2.0, OR as defined in my CDVP2 class.   What???  You’re not yet certified?  Why?  What’s stopping you?  I am offering two additional courses this year, you can sign up for them at: http://DataVaultCertification.com
  2. Implement Point In Time and Bridge Tables – AND DO IT PROPERLY!!!  Again, as defined in my book: Building a Scalable Data Warehouse with Data Vault 2.0, OR in my CDVP2 class.

By the way, the benefits of properly implementing these two paradigms far outweigh the costs of “updates” against the Data Vault model.  Furthermore, these two structures move your entire EDW DV2 in to the 21st century, and make it 100% insert only compliant.

Moving to 100% insert only architecture is currently (and in to the future) the best thing you can do for your EDW DV2.

If you have questions or comments, feel free to post them below.

Thank-you kindly,
(C) Dan Linstedt, 2016 all rights reserved

Tags: , , , , ,

25 Responses to “Updates in Data Vault are DEAD! Finally”

  1. Kent Graziano 2016/08/09 at 9:17 am #

    Totally agree Dan! An “insert only” architecture is the only way to go these days if you are at all concerned about Volume & Velocity of inbound data streams. Glad to see this is now “official”

  2. Dan Linstedt 2016/08/09 at 9:49 am #

    Hi Kent, thanks for the feedback. Yep, it’s finally official. I will be removing all annotations to these items from the CDVP2 training class. Well, I will still teach end-dates, but only as a secondary option, and I will list all the risks (which are many if one uses compression in the database). What people need to now understand is: proper use of PIT and Bridge tables, which I will blog on shortly. Thanks for the support, Dan

  3. Andreas 2016/08/10 at 12:53 am #

    Totally agree.

    1.) End dating satellites = updates are too slow for huge data amounts. These updates make the Raw Data Vault LOAD JUST UNNECESSARY LONGER
    2.) Use max(loaddate) for each query on satellites instead? No! SQL statements are complex, execution times are bad and explain plans are terrible! PERFORMANCE TUNING NIGHTMARE!

  4. Dan Linstedt 2016/08/10 at 5:33 am #

    Andreas,
    If you are using max(loaddate) to produce data for end-user queries, then you do not understand PIT and Bridge tables and their function. However, that said: you only have two choices:
    1) use Load and Load End Dates to build the PIT & Bridge Images
    OR
    2) use LEAD OLAP functions (as Kent Graziano stated in his blog entries) to BUILD the PIT and Bridge images.

    Once the PIT and Bridge Images are built, the end-user AD-HOC queries should be equal join direct to Satellite (not needing end-dates or neseted select max(loaddate) queries anymore). If you cannot get equal-join to work against the PIT or Bridge, then: a) the PIT and bridge have not been properly setup with ghost records, b) the PIT and Bridge have not been properly designed.

    Hope this helps,
    Dan

  5. Andreas 2016/08/10 at 11:09 pm #

    Hi Dan,

    sorry for the confusion.

    I wanted to say that using max(loaddate) on SAT tables is cumbersome: SQL statements are complex, execution times are bad and explain plans are terrible.

    Using PIT tables in joins to SAT tables instead is the faster approach. Easier join conditions and the possibility to avoid outer joins with ghost records.

  6. Dave 2016/09/13 at 8:12 am #

    Dan,
    does this mean that handling of Driving Keys is now handled in PIT tables (i.e. no need for the ‘Load End Date’ method described on pg 121 of ” Building a scalable data warehouse with Data Vault 2.0, “?

  7. Eric 2016/10/03 at 6:06 am #

    Hey guys,

    I have been doing so for the last 1,5 years! Cool that it finally is official, that way I don’t have to argue to get my point across 😉

    Cheerz,

    E

  8. Evgeniy 2016/11/13 at 9:29 pm #

    Hi Dan,

    Does this mean that we will not be able to establish in Raw Vault and beyond (Business Vault / Data Marts) that the record does not appear in the source system anymore?

    Are you saying that there are no legitimate business cases that require this information?

    Regards,
    Evgeniy

  9. Dan Linstedt 2016/11/14 at 3:52 am #

    Hi Evgenly,
    I am not understanding your question. But I will attempt to guess. The point here is this: end-dates are not required as a part of the architecture in order to make your data auditable. End-dates were NEVER required. More specifically Load End Dates. I am not referring to “business driven end dates” that come in from the source feeds, I am however referring specifically to load end dates. Now, if you are stuffing a load-end date with a business driven source date, then your load pattern is flawed and needs to be corrected.

    Load-End-Dates are system driven values only, and should never be used or seen by business. That said, you can (if you want) calculate the span (between load dates and load end dates) and materialize that span inside of a point-in-time or bridge table (this is the appropriate / proper design technique).

    I cover all of this and more in my CDVP2 course

    Thanks,
    Dan Linstedt

  10. Evgeniy 2016/11/14 at 6:35 pm #

    Hi Dan,

    Thank you for the detailed response – let me explain in more details…

    Unless you capture load end date of a satellite record in some way in Raw Vault , e.g. using Load Effective From / Load Effective To model (example of a satellite table below)

    CREATE TABLE [Sat Customer Details] ([Customer Hash Key], Load_Effective_From, Load_Effective_To, Name, Address, … )

    …or some other format, the downstream layers – Business Vault & Data Marts will not know that a satellite record does not appear in the source system anymore.

    E.g. a customer was soft – or hard-deleted in the source system, therefore it should not appear in the data mart customer dimension.

    Are you proposing to capture Load End Date (or Load Effective To) date as a separate insert in a satellite table in Raw Vault?

    Where will PIT table generation code find out that this satellite record does not exist in the source?

    Thank you,
    Evgeniy

  11. Dan Linstedt 2016/11/15 at 7:42 am #

    Unfortunately you have changed the meaning of the Load End Date. You appear to be using it as a deleted date marker. This is completely incorrect – it overloads the definition of Load End Date and defeats the primary purpose of what the Load End Date truly represents.

    If you want deleted dates or to track deleted satellite data, you should be using an added attribute called “deleted date” in the Sat itself, or leveraging Status Tracking Sattellites.

    Load End Dates have one and only one purpose: to show that the satellite record has been superseded by a new delta. This, can and should be calculated on the way out – to the PIT and Bridge tables, where the span of the record is captured and stored in a physical manner.

    Soft and hard deletes are handled completely different (as I suggest here). I cover all of this in deep detail in my CDVP2 class, the book: Building a Scalable Data Warehouse with Data Vault 2.0 also covers these aspects, complete with query code.

    Unforunately you’ve made a mistake in assigning a “deleted date” to be the Load End Date Value – which is the wrong way to approach this problem.

    I am sorry about that, but this is the change that needs to be made / corrective action to be applied.

    Hope this helps a bit,
    Dan Linstedt

  12. Krzysztof Rychlik 2016/12/02 at 10:56 am #

    hi guys,

    I got a question on this and I am hoping someone can shed some light for me. I am new to DV and only recently did I start implementing it. So apologies up front if my question is silly. 🙂 I feel like I m missing something very obvious.

    I have been doing updates to my Load End Date fields. I now want to perform inserts only as recommended. Using the lead/lag function in SQL Server I am able to display end dates when ever I need so I don’t need to perform an actual update to the end date field.

    However, when I do an insert to my satellite I still need to be able to select the most recent record from my satellite and compare it to my record in staging to see if any of the descriptive fields have changed. I was using the where clause before WHERE STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD AND SAT.LOADENDDATE IS NULL

    Since I wont be using the LOADENDDATE field anymore, how can I know which record is the most active one? If I don’t use the most recent one I will adding the new record from staging over and over again each time the job runs because there is always going to be a record in the satellite (the old record) where the STG.DESCRIPTIVEFIELD SAT.DESCRIPTIVEFIELD

    So the solution to it is to do inserts with my left outer join but using the window function. So I do my left outer join to a view where I select the most active record from the satellite using the lead/lag function – but wouldn’t that be equally costly to an update? My gut tells me this is not the correct approach

    What am I missing? what is the correct approach to this?

    KR

  13. Piotr 2016/12/07 at 11:27 am #

    Hello,
    I’m working on a DV model at the moment and would like to get rid of load end dates but I’m lacking few things – I’m not sure how to implement the PIT refreshes. I understand that I need to to refresh the PIT table after each load of the related Hub and its Satellites? Then also a process of deleting old snapshots will be needed – I guess as described in the book – so that the PIT table to not grow to infinity?
    What about models dedicated for real-time loads? I feel like it won’t be feasible to add a new snapshot to the PIT table after each/couple of key inserts to the Hub and/or Satellites. How to tackle this?
    Cheers,
    Piotr.

  14. Piotr 2016/12/12 at 10:20 am #

    Hello,
    I’ve got another question – as per the last seen date in the hubs and links – in order to get rid of them I understand that you recommend using the record tracking satellites (chapter 5.3.5). I’ve got a link with 3 million records and if I understand properly it means 3 million records will be added to the record tracking satellite by each load. I don’t think this is acceptable. Do I understand the idea correctly at all?
    Cheers,
    Piotr.

  15. Steve 2017/03/27 at 8:37 am #

    Hi Dan, I’d like your input regarding Driving Keys as well. We’re actually realising the LoadEndDate serves little purpose for us, since we are “virtually” creating PIT tables, but wonder how to manage it without updating the loadenddate during the load of links driven by LDK.

    Note that we have a particular use case here, we are loading and presenting data in near-real-time. Our “batches” are run every 15 minutes and we’d like to bring this to a 5 minute refresh rate.
    I know the PIT should be “snapshot”, but I fear this to be hard on a 15 minutes refresh rate, when were are speaking of couple hundred millions satellites and links.

    I’ll be attending your class in Vermont in April, but hints would be great since we are already rolling into production as I write these lines and the normal load just takes way too long because of the updates.

    Kind regards,
    Steve

  16. Dan Linstedt 2017/04/13 at 7:43 am #

    Hi Steve,

    Surprise Surprise, Driving Keys are a Modeling Mistake… The reality is much simpler than written in to the book. The problem is: the Satellite is in the wrong place.. Move the Satellite to the HUB where the Driving Key “lives”, and the problems are all solved. I will be posting on this soon, very soon. Now as for PITS – and BRIDGE loads: use OLAP and CTE functions to calculate spans / begin and ends, etc.. that are Business Driven, and put those in to physical columns in the PIT and BRIDGE structures, then the business user queries can leverage between clauses with indexed fields.

    On a 5 minute or 15 minute refresh rate, you should build the load of the PIT in to the process of detecting only “deltas” to load to the PIT table. In other words, some of the advanced techniques for “loading PIT’S” I’ve never documented or offered a class on, the idea is an incremental PIT instead of a full load for every cycle. there are a couple of techniques for that, I will be discussing that soon. But we can talk about hundred million row tables, and billion row tables – I have customers using both successfully. I’d need to know more about your back end RDBMS before I can comment.

    The class in April will touch on these things, but sadly, I don’t have enough time to dive deep in to advanced concepts… that said, after class, or before class we will have some time to do it.

    I think, perhaps, it may be best to secure some remote consulting time of mine, we can work this out over the web. Feel free to contact me directly: DanLinstedt ( @ ) gmail.com

    Thanks,
    Dan Linstedt

  17. Dave 2017/06/20 at 9:35 am #

    Hi Dan,

    did you have an answer for Krzysztof Rychlik regarding how to identify the most recent Satellite record against which to compare the HashDiff when we no longer have a LoadEndDate?

  18. Dan Linstedt 2017/06/20 at 10:35 am #

    Hi Krzysztof,

    1) Load End Dates do not nor have they ever carried any business value or meaning
    2) People misuse and incorrectly apply load-end-dates, they overload the meaning and cause problems for queries and understanding.
    3) You can & should use OLAP and CTE functions (LEAD & LAG) to calculate the load-end-dates on the way to the PIT and BRIDGE structures IF you truly need them, but in reality you should use the BUSINESS dates that arrive from the source system instead.
    4) Load End Dates CANNOT be maintained in real-time, the DO NOT WORK because streamed transactions most of the time, DO NOT arrive in the order in which they were created. So for this reason alone, Load End Dates are DEAD. Do NOT use them.
    5) I teach ALL of this and more in my CDVP2 class (http://DataVaultCertification.com) you can check it out, possibly try to attend and get certified if you can, it will answer all your questions and more.
    6) to know where the most active record, use the sub-select MAX query, or apply the query against the PIT table to find the most active / most recent record. This is described in detail in Chapter 6 of the book: Building a Scalable Data Warehouse with Data Vault 2.0.

    No, Left outer joins are not always the proper technique. That depends on the database, the indexing, the partitioning, and so on…

    Hope this helps,
    Dan

  19. Dan Linstedt 2017/06/20 at 10:38 am #

    Hi Piotr,

    This is all covered in the CDVP2 class, please plan on attending the Certification course for Data Vault 2.0.

    The short answer is: The PIT and Bridge tables should both be rolling windows of time. Not ALL history belongs in the PIT, and furthermore, you don’t need to keep JUST one PIT, sometimes we create multiple PIT and Bridge tables to meet different business needs.

    Real-time is covered during my CDVP2 course, it is a 2 hour lecture, and contains far too much information to describe the answer here. I am sorry about that. It is definitely feasible, and can be done – we’ve built it in many many many solutions over the past 10 years.

    Thank-you kindly,
    Dan Linstedt

  20. Dan Linstedt 2017/06/20 at 10:40 am #

    You understand the base idea, and Record Tracking is feasible as long as you turn on compression. 3 Million records really is not that much at all, and with Record Tracking done properly, compression should buy you 85% savings or more. Yes, Last Seen Dates are also DEAD… Anything that introduces an UPDATE / physical DB update, is Dead. Because it does not scale, and it does not work properly in real-time. This is the nature of what we do, until we have access to new technology like “temporal tables” from Microsoft SQLServer 2016.

    Hope this helps,
    Dan Linstedt

  21. Dan Linstedt 2017/06/20 at 10:41 am #

    Hi Dave,

    Just posted an answer. Thanks.

    Hope this helps,
    Dan

  22. Eduardo Grosso 2017/08/06 at 11:57 pm #

    Hi Dan,
    It is interesting to see how confusing these end dates are. I believe what most people need is a business version end date for a data entity instance. I.e. The version of the entity ceased to exist in this “end date”. For example if a customer changed the address, and this is captured in a satellite table depending on the customer hub, and the satellite table contains only addresses, the version of the element of this particular satellite table needs to somehow indicate that is part of the past. Now sometimes there are no “business dates” that will tell you this. Some systems just change the address of the customer, overriding whatever address the customer previously had. Hence we need to fabricate an end date, as in the DW we want to capture the temporal aspect of these changes, although this is not captured on the source. This implies we need to introduce a new data element, the end date of the address where this customer ceased to live in. This date as mentioned, is not coming from the source, but we need to introduce it, either as “version start date” and/or “version end date”. Also we need to consider that the “version start date” alone is not enough, as this implies there will be a new version with a new “version start date”, but this may not be true. The customer ceased to live in an address, but we may not have his/her new address, how can we resolve this business problem them, without start and end dates?

  23. Dan Linstedt 2017/08/07 at 8:22 am #

    Thank you for your comment. You are correct in stating that business needs BUSINESS driven end-dates / start dates. What I will say is this: You are focusing (once again) on the wrong thing. Load Dates and Load End Dates are SYSTEM driven for the data warehouse only, They are MEANINGLESS to the business. (sorry for the emphasis).

    Load End Dates are DEAD. and will remain this way. In fact, NO “physical database update” can survive in the ever growing data sets we live in, and any attempt to “issue an update statement” in the database, will *require* re-engineering and re-design / elimination as the data set grows beyond the platforms ability to execute in a timely manner. Furthermore, again: load end dates are system driven. Once you load data from multiple time zones to a single warehouse, “load end dates” get confused, and become useless. Once you load data to a cloud driven data warehouse, Load Dates AND load end dates become confused and become useless…. So this begs the question: how and when can you / should you track “status of records disappearing off source systems???”

    Valid question: the answer is two-part:
    1) ALL ROADS LEAD TO CDC (CHANGE DATA CAPTURE) this is the ONLY sure fire auditable method for executing full compliance with GDPR laws and regulations. Don’t have CDC???? the EDW WILL FAIL AN AUDIT, AND WILL NOT BE IN COMPLIANCE WITH THE NEW GDPR LAWS and regulations!!!

    2) Best guess – ok, so reality says: it will take years to comply fully with GDPR or complete auditability, alright – so best guess comes in. This means calculations for rows disappearing “have to happen” – we do this using Record Source Tracking Satellites, and we compute Physical End Date columns on the way to the Business Vault or the Information Marts.

    I teach all of this in the CDVP2 class, happy to discuss with you – engage in a few hours of consulting if you like.

    Thank you kindly,
    Dan Linstedt

  24. J.P. 2017/09/13 at 3:22 am #

    “If you want your Data Vault 2.0 to be 100% insert compliant REMOVE last seen dates, and REMOVE load end dates.”

    “2. Implement Point In Time and Bridge Tables…”

    The book “Building a scalable data warehouse with Data Vault 2.0” defines [load end date] as a required attribute in satellites (p.116, p.117). SQL examples in the book, including the example for loading a PIT table (p. 599) , make use of load end dates. Is the book I just bought out-dated? Is an erratum available?

  25. Dan Linstedt 2017/09/13 at 4:46 am #

    Hi JP,

    Thanks for buying our book. It was published over 3 years ago. 90% of what is stated is still valid. There are only minor changes along the way. The changes are taught in class (CDVP2 – Certified Data Vault 2.0 Practitioner). Unfortunately the publisher has not given us any sort of erratum pages to provide the changes to the public. It is a good idea, so I will try and start one here on this site.

    That said, anything I publish on this blog that contradicts the book should be considered correct and valid going forward. You can use Load End Dates – but: they are officially dead, not applied – because of volume constraint issues, and real-time loading issues.

    Hope this helps,
    Dan Linstedt

Leave a Reply

*