#teradata and #datavault 2.0 – thoughts on end dating

recently i taught a class on-site for a customer all about data vault 2.0.  when i got to the point where i shared the template / process for end-dating (updating end dates in place, using a characteristic function) i was point-blank told: “teradata does not use or have indexes other than the primary index”  and that it would subsequently do a full table scan every time we wanted to execute a select from a satellite where the end date is null and has at least 2 or more “active” rows.

the psuedo-code design for updating is:

  1. select sat_hashkey, sat_load_date from sat where sat load end date is null (or high date) and having at least 2 active rows (where 2 or more rows have a null or high date load end date).
  2. run a row-over-row characteristic function to compute the end-date
  3. update the sat in place.

well, i felt this statement that teradata does not use or have indexes might be a little off-base, so i did some research.  here is what i found:

now, that said: it is very clear to me that teradata does in fact support additional indexes beyond the ppi, upi, nupi all are (primary indexes).  this notion is called a secondary index.

so: in fact, to perform an update in-place against the same satellite i am selecting rows from “for update”, it is possible to avoid complete and full table scans by building a secondary index.

does that mean it’s the right thing to do?  maybe, maybe not.  it depends on data set size of the table, how many rows are split in to which partitions, and velocity (latency) of incoming data set.   just like every other rdbms in the world, secondary indexes are only as useful as their statistics, and their statistics must be kept up to date, otherwise a full table scan will result anyhow.

so, if the cost of having a secondary index is too much, or the cost of updating the secondary index stats outweighs the benefit of the end-date update process, then granted, another solution must be found.

an alternative solution (yet a bit more risky) is:

  1. identify all the rows in the stage which have delta’s and will be inserted new
  2. end-date (match by hash key) old rows in satellite
  3. insert all new rows & delta rows in one shot to the satellite

the risk with this approach is restartability potentially loading duplicates, or if the process breaks after step 2, then a potential “rollback” needs to be issued to remove the end-dated rows before step 3 can insert non-duplicates that have a delta.

anyhow, there are additional risks to approach #2, all of which approach #1 (as per data vault 2.0 design standards dictate) have solved.

it is therefore my humble opinion, that even on teradata, even with massive data sets in your satellites, that secondary indexes and single-pass characteristic functions for “end-dating” rows will continue to be faster, and more resilliant than any other approach outlined.  i do however, welcome your thoughts.  do you work on teradata? have you used data vaults on teradata in a high volume solution?  let me know what you think.  what are the pro’s and con’s of secondary indexes and the update logic i propose?

hope this helps,

dan linstedt

Tags: , , , , , , ,

4 Responses to “#teradata and #datavault 2.0 – thoughts on end dating”

  1. Kent Graziano 2014/05/20 at 11:24 pm #

    What about a 3rd option of no updates and create a PIT table to track the current rows?

  2. Dan Linstedt 2014/05/21 at 1:23 am #

    It is quite possible as you suggest Kent, to execute a model design with just a PIT table. However, if all you are doing is tracking the current rows, then anything using the PIT will only get current rows during query time (perhaps I misunderstood your statement). Are you historizing PIT rows?

    Anyhow, the point is: in order to execute update in-place to end-date records, you need a secondary index, and yes, Teradata DOES support this, you just need to be aware of the consequences of implementing them (the same prices are paid on every relational DB).

    Hope this helps,

  3. Linus Hjorth 2015/07/12 at 10:39 pm #

    Wouldn’t use of TD bi-temporality remove most of the headache? It would make step 2 in you alternative solution obsolete.

  4. Dan Linstedt 2015/07/15 at 9:29 am #

    Perhaps. It might solve PART of the problem, but not all of it. I published this post in 2013, before the technology evolved. Remember this too: any time I work with standards I think about applying the architecture and design across multiple platforms. Before I can declare “temporality in DB engines can be utilized” – It has to be available in all platforms. This simply isn’t the case. It’s TERADATA spectific. That doesn’t make it bad, it just makes it so it cannot be declared a standard.

    Now that said, an ALL Teradata Shop that wants to LOCK itself in to a technological base, can feel free to use the special features that the RDBMS vendor offers… That’s the danger.

    The other problem? Data is NOT simply bi-temporal. Many data are multi-temporal, and humans just have a hard time thinking in multi-temporal dimensions.

    Hope this helps,

Leave a Reply