Fail to Prepare, Prepare to Fail reminder

Joins, Natural Keys, Hashing MPP and #datavault

i’ve written about all of this on my blog numerous times over the past 6 years.  you can see and read a compendium or collection of the articles here.  that said, it seems there still are confusing statements being made about the reality of how all this works.  in this entry, i will take another look at this space – specifically for mpp systems, and even more specifically for teradata. i hope to clear up any misconceptions about how this works, why it works, when it works, and when it doesn’t work in this particular environment.  this will be a highly focused article specifically for the teradata relational database engine that is available today (as of november 2016).

before i get going,i need to acknowledge the folks who have helped, inspired, and taught me along the way most of these components.  you see, with data vault 2.0 i truly do stand on the shoulders of giants, and have leveraged many best practices that others (who have come before me) have developed and proven to be successful.

i first want to thank my friends at teradata for a) supporting me all these years, and b) providing me with training in the early days so that i understand how teradata works.  i also want to thank all my customers that work with teradata, and all the wonderful dba’s who have shared their knowledge with me.  i’ve met and chatted with folks who worked for walmart, cocacola, air lingus, and more – all teradata dba’s at the time.  i also want to thank folks like stephen brobst, for sharing his vast and extensive knowledge with me on big data solutions, mpp environments, and parallel query processing.  i want to thank sanjay pande for assisting me with hadoop understanding, as well as my customer: commonwealth bank for implementing dv2.0 with hashing on teradata, and utilizing cloudera hadoop platform along the way.  i want to thank my original team members at the department of defense, who also helped me understand the nature of true real-time and i.o.t. (machine generated data as it was known back then) in the late 1990’s.

undoubtedly i have made a few errors in the statements /claims below.  if you wish, and have accurate citations, please feel free to correct me (include the citation), and i’ll be happy to have the correction on-board.

data vault 2.0 …

  • data vault 2.0 states: must use hashing as a primary key
  • data vault 2.0 does not state which hashing function you are required to use
  • data vault 2.0 does not tell you how to solve hashing collisions, only that you must decide on a collision strategy that’s best for your environment
  • data vault 2.0 states: hashing solves parallel heterogeneous system loads in large scale out operations, spread out over multiple servers and quite possibly across the globe.  the hashing then allows the data to be “late-bound” (ie: joined) after being inserted, without any inter-dependencies on “looking up parent records first.”


  • mpp states: the best way to leverage an mpp scale platform is to evenly distribute data across all available compute resources
  • mpp and mathematics states: to solve a different query’s over mpp and to make it performant, the architect or designer must properly co-locate data on the shared-nothing nodes.  this, in turn dicates the following: it is near impossible to achieve maximum performance for 100% coverage of all queries all the time, without full and complete replication of every “joined object in the query” – meaning: cartesian product applies to making copies of this data, and replicating it across each of the nodes, laying it out in full multiple times, in order to reach 100% query coverage with maximum performance benefit.


the goals of data vault 2.0

  • the model goals: to be cross-platform, ubiquitous, vendor independent design that provides performant, scalable, high speed queries, agile build cycles, and works based on proven mathematical formulas for scale-out distributed data systems.
  • in addition, the design’s goals are repeatable, consistent, redundant, fault tolerant.
  • the architecture’s goals are to be de-coupled, so that as pieces of the architecture become obsolete, they can be shut down (such as a staging area no longer needed by a true real-time system).
  • the process goals: to allow high speed, parallel load and query operations to be constructed that will not need re-engineering based on relative size of data, nor latency of arrival changes (this is one part where other data warehousing methodologies actually fall down).

that said:

  • in theory, only a perfect hash can actually produce a unique hash without said collisions.  however, that said the “proofs” for these being actually “perfect” are still running trials, looking for collisions throughout their set of data (these trials may actually be completed by now), however that said: perfect hashes are not yet ubiquitously available across all platforms (oracle, teradata, mysql, sqlserver, aster, hadoop, hbase, informatica, data stage, ab-initio, perl, php, c++, c#, .net, asemmbly, etc…)
  • can a perfect hash be used in data vault 2.0? yes.  if you are willing to implement the function on your own according to it’s respective rfc (some downloadable code works, others don’t – are riddled with bugs).  yes, even something like sha-1 which is not a perfect hash has been shown to exhibit buggy behavior in oracle 11g (ask kent graziano on this one).
  • why then md5 for dv2.0?  because it is the most ubiquitous, most well rounded, most tested, and widely available hashing algorithm in nearly every environment.  is it the best solution?? no.  this, is a compromise – but again, i urge you to re-read the top of these statements: i do not specify which hash function you have to use for dv2.0 compliance, just that you must use a hash as a primary key.

the data vault 2.0 claim:

  • the only way to achieve true parallelism across multiple heterogeneous platforms during load, that will scale out without a “lookup bottleneck” is to compute a hash based on the available row data at the time the row is being processed.
  • i welcome any mathematician (or otherwise) to find a proof that shows this is an incorrect assumption, and therefore by default, to arrive at a mathematical formula which provides the functionality i speak of -without using any sort of hashing.  at that point, i will say: the mathematician is right, we have a new standard, let’s test it and vet it, and see if it qualifies to become the new standards for the data vault solution.   any mathemetician that can solve this problem, also can solve the mpp co-location problem in an algorithmic fashion – and would have an incredible piece of potentially valuable intellectual property.

more facts:

  • sequence numbers, and sequences – no matter what the platform require lookups to parent records in order to establish relationships to children.  this very simple fact, disqualifies sequences from ever being the proper choice to meet the demands of big-scale systems, and/or high speed real-time solutions for ingestion without bottlenecks.
  • that said: teradata never said sequences solve these problems.  the magic that seems to be escaping everyone (if there is any magic), is that teradata hashes a primary index value. that primary index value might as well be a natural key.  why?  because teradata does not use the natural key value in joins (if and when they are specified as the primary index) – it will hash the values first, then find the amps, distribute the logic and go straight to the disk blocks.  don’t believe me?  read the teradata documentation about how primary indexing and hashing and natural keys work.  what i specifically mean is: when teradata does go to disk to get a row to join, it uses the row hash id – not the physical value.

the big question…

why then does teradata say: use sequences or natural keys instead of the dv2 suggested hashing?

  • the answer? the sequences in teradata function only as unique row identifiers, that’s right – they do not represent the primary index, but just the primary key – the unique value to a row.

which leads to the next big question:

why then do all the teradata models i’ve seen (from teradata) have parent child relationships expressed with sequence numbers as primary keys?

  • great question – why don’t you ask teradata why they do that.  in doing so, this does indeed break the fundamental principle of data vault 2.0 that i stated earlier: massively parallel, heterogeneous, completely independent load cycles direct to the data vault model.  yes, that’s right, without a staging area.
  • that said, these very same models are currently being sunsetted by the clients that i visit, due to their high levels of parent-child dependencies – which end up forcing the organization to incur ever increasing costs and delayed maintenance cycles due to cascading parent-child impacts.
  • my opinion on this matter is this: sequences in teradata work for loads, for several reasons.
    1. sequences are assigned in blocks to the nodes / amps / compute units
    2. sequences in teradata are only guaranteed to be unique, not to be in order, and not guaranteed to be fully sequential (they are allowed to have “holes”)
    3. the loads to teradata (for most big data solutions) require fastload (sometimes multi-load will suffice).  once the data is in teradata, it is assigned sequences as parallel loading to the target tables take place using teradata sql procedural language.  this is typically done in block style execution.  however, that said: it happens all within the teradata environment.  this is neither cross-platform parallelism, nor heterogeneous parallelism (it breaks the design goal of data vault 2.0)
  • or the alternative: why do some teradata models work perfectly well with natural or business key values?  again – because the hashing of the value is done internally, and the hash itself is used for the join, not the natural key, not the value of the data.
  • the list continues from here, but i have already bored you to tears with the technical details of how teradata works.

i’ve spent a good portion of this blog talking about the loading process. now before i leave this subject, let me just say this:

the last thing you want when loading a hadoop solution, is to a) wait for some relational database to first insert the parent record, b) have to lookup that parent key c) assign that parent key to a “child document”  this would defeat the entire purpose of ingesting data in to hadoop in the first place.

time to move on, let’s talk about querying, hashing and sequences for a minute.

  • myth: hashing rows slows down query engines.  truth / fact – databases (outside of teradata) simply aren’t properly optimized or configured to handle it, their optimizers don’t understand mpp, big data, and data co-location problems (or simply aren’t tuned yet to work with these situations).  fact: if the actual hashing process and hash data were truly to blame for the slowness of the query, then engines like teradata (and yes, even hadoop) would never have utilized such a system in the first place.
  • myth: sequences are always faster in joins than hashes.  ok – outside teradata, this is most likely true due to the fact that the hash value is longer than a sequence in bits/bytes.  fact: teradata never actually joins on the sequence number (* see my statements above about how teradata actually constructs its join logic).  read here:  – it covers the join techniques, but always refers back to the primary index.
  • but but but in my <oracle, sqlserver, db2 udb, mysql> environment, my results actually show a slow down when using hashes to join…  yet you say it’s a myth, where’s the disconnect?   the issue is again: these database engines are not equipped / tuned for mpp, the optimizers use literal values stored in index trees to perform the join matching, where teradata uses a computed hash value – so.. of course it will be slower in your environment.  we need to start asking a different question!!!!

what is the different question?

how can i get dv2 to be performance driven with hashing and joins on these other platforms?

  • answer: point-in-time tables, link tables, and bridge tables.  if you do not understand these structures, their purpose, how they work, and why – then you will not have much success in creating performant joins and getting data out of the data vault in a hash based model.  these structures provide equal-joins, allow for virtualization, and query optimizers in the “traditional platforms” to utilize star-join techniques developed for star schema based models.  these structures (when designed properly) offer table elimination, index optimization and data co-location (similar to teradata’s ideas of a join index but much less restrictive) on mpp environments.
  • bottom line?  hashes let you join from within a relational database out to a nosql engine (as long as the hashes are computed the same way with the same character sets and same rules),  on-demand or at-will.   sequences can not achieve this without “looking up” the sequence and “copying it to” the child record living in the nosql engine.

wait, should i use hashes in teradata for dv2 or not?

  • for teradata, the answer sadly is: it depends.  my question to you is: is a heterogenous highly scalable, global solution what you are trying to build?  without inter-dependencies???  do you have a business need to join teradata data to another platform on-demand / at-will (again, without the dependency of having to copy or move a sequence number to the other environment)?  if the answer to these questions is yes, then you have no choice but to use dv2 style hashing as your primary key (not necessarily your primary index).

if the answer to the above question is no, then i ask you this: do you have massive amounts of data bottle necking your batch load process today?  can you afford to have your loading processes stay this way? do your lookups take too long?   if the answer is yes, then you have two possible solutions:

  1. choose to use sequences, but move to a 100% real-time feed, get rid of all the batch processing entirely.  that said, this solution will only scale for a while – until you have too many parallel real-time feeds loading data to the same targets at the same time, all needing that “dependent lookup”
  2. switch to dv2.0 hashing and call it a day.  increase your hardware if you need added performance, or switch platforms to one that deals with this easily, or push the vendor of said rdbms to innovate their optimizers and finally “get with mpp properly”.  (which i’m happy to say microsoft has just accomplished this with their latest release of sqlserver data warehouse edition)

so at the end of the day, i’ve blogged on this before: can you use sequences in teradata for data vault 2.0 and be ok.  technically, yes.  methodologically speaking: no.  it breaks standards of scalability, and heterogeneous parallel systems.  same goes for “natural keys” in teradata.  why only in teradata?  why not in oracle or sqlserver with sequences?  because neither of these database engines (exception: sqlserver 2016 data warehouse mpp edition) actually implement mpp shared nothing environments, leading to bottlenecked loading processes over big data “in a single cycle”, and / or massive amounts of low-latency data arrivals.

take your pick.  consider yourself warned of the consequences of sequences in any system.  end of the day?  today you don’t have big data during your load cycle, so you choose sequences.  tomorrow – when big data arrives, you will be forced to re-engineer – it is at that point, that sequences will fail.  or the alternative: today you don’t have a heterogenous or hybrid solution to deal with , so you choose sequences; tomorrow – voila you have to make the data “connect” to hadoop or hive, or hbase.. again – at this point, sequences bottle neck and fail – you will be forced to re-engineer your solution.

please remember that most of the customers i help with, are building global, enterprise bi solutions that handle (on average) 250 terabytes all the way to 3 petabytes in size.  i am trying my best to provide you with a road-map so that the system you build today, will work tomorrow without re-eningeering!

thank-you, and i hope this helps,

(c) copyright 2016, dan linstedt, all rights reserved.

Tags: , , , , , , , , , ,

4 Responses to “Joins, Natural Keys, Hashing MPP and #datavault”

  1. Kent Graziano 2016/11/18 at 12:02 pm #

    Thanks for the post Dan. I know a lot of folks keep bumping up against this question.

    For the non-Teradata folks I will add that I have used MD5 hash as PKs for Hubs and Links (and stage tables!) on both Oracle and SQL Server. Yes, the join performance is a bit slower than using numeric sequences (we did actually run tests and timing), BUT the benefit of being able to load multiple objects in parallel, using calculated MD5 PKs, out weighed the join performance hit because of the ability to load faster. Regardless, in both cases we were able to virtualize the reporting layer (mostly facts and dimensions) using simple views to no ill effect. Meaning the reports were plenty fast even though all the joins used char(32) columns. So both systems are positioned to expand, when/if needed, to utilize big data/Hadoop/NoSQL .

    In other words, we future-proofed the architecture by sticking to the DV 2.0 standards.

  2. Dan Linstedt 2016/11/18 at 12:29 pm #

    Hi Kent, thanks for the feedback… Actually to assist performance further in Oracle and SQLServer, you can now switch from a CHAR(32) to a Binary(16) in SQLServer and raw(16) in Oracle – yes, each platform allows this, and yes, it can be made a Primary Key, and it works so much better than a CHAR(32). This, is documented in my Hashing Document that is only available to students in my CDVP2 course.

  3. Max 2018/02/06 at 7:55 pm #

    I hope Dan is still reading the comments on old posts 🙂

    There is almost nothing on the internet about implementing DV 2.0 on Hadoop and Hive.
    What about this? Has anyone tried to build a DV on Hive and what is the result?

  4. Dan Linstedt 2018/02/07 at 7:57 am #

    Hi Max,

    This is taught in my CDVP2 Classes, we do talk about this, and I have a number of customers building Data Vault on Hive and on a number of Hadoop Platform Offerings. Come to the class, and find out how we make this work.

    Dan Linstedt

Leave a Reply