Recently I taught a class on-site for a customer all about Data Vault 2.0. When I got to the point where I shared the template / process for end-dating (updating end dates in place, using a characteristic function) I was point-blank told: “Teradata does not use or have indexes other than the Primary Index” And that it would subsequently do a full table scan every time we wanted to execute a select from a satellite where the end date is NULL and has at least 2 or more “active” rows.
The psuedo-code design for updating is:
- select SAT_HASHKEY, SAT_LOAD_DATE from SAT where SAT LOAD END DATE is NULL (or high date) AND having at least 2 active rows (where 2 or more rows have a NULL or HIGH DATE load end date).
- run a row-over-row characteristic function to compute the end-date
- update the SAT in place.
Well, I felt this statement that Teradata does not use or have indexes might be a little off-base, so I did some research. Here is what I found:
Now, that said: it is very clear to me that Teradata does in fact support additional indexes beyond the PPI, UPI, NUPI all are (primary indexes). This notion is called a secondary index.
So: in fact, to perform an update in-place against the same satellite I am selecting rows from “for update”, it is possible to avoid complete and full table scans by building a secondary index.
Does that mean it’s the right thing to do? Maybe, maybe not. It depends on data set size of the table, how many rows are split in to which partitions, and velocity (latency) of incoming data set. Just like EVERY other RDBMS in the world, secondary indexes are only as useful as their statistics, and their statistics must be kept up to date, otherwise a full table scan WILL result anyhow.
So, if the cost of having a secondary index is too much, or the cost of updating the secondary index stats outweighs the benefit of the END-DATE UPDATE process, then granted, another solution must be found.
An alternative solution (yet a bit more risky) is:
- Identify all the rows in the stage which have delta’s and will be inserted NEW
- END-DATE (match by Hash Key) old rows in Satellite
- Insert ALL new rows & Delta rows in one shot to the Satellite
The RISK with this approach is restartability potentially loading duplicates, or if the process breaks after step 2, then a potential “rollback” needs to be issued to remove the END-DATED rows before Step 3 can insert non-duplicates that have a delta.
Anyhow, there are additional risks to approach #2, all of which approach #1 (as per Data Vault 2.0 Design Standards Dictate) have solved.
It is therefore my humble opinion, that even on Teradata, even with MASSIVE data sets in your Satellites, that secondary indexes and single-pass characteristic functions for “end-dating” rows will continue to be faster, and more resilliant than any other approach outlined. I do however, welcome your thoughts. Do you work on Teradata? Have you used Data Vaults on Teradata in a high volume solution? Let me know what you think. What are the pro’s and con’s of Secondary indexes and the update logic I propose?
Hope this helps,