#datavault 2.0 Hashing and Sequences Round 2

once again, the debate rages on….  i’ve now been inundated with requests and demands, and issues around hashing, so i wish to clear the air.  in this entry i’ll dive in to sequences (again) and big data, and the problems they pose with big data and high velocity systems.

first, let me get this off my chest:

i never claimed that hashes cause zero collisions.  to my knowledge and understanding, there is currently, no such thing as a “perfect hash” that will produce a guaranteed unique identifier for a single input.  people in the community at large seem to forget, or have amnesia when it comes to this statement, and insist on continuing to argue with me that i “somewhere / somehow” made this claim.  let me be perfectly clear:  i never said that hashes have zero collisions!!

so please, don’t bring that up again.  i will not spend any further cycles or thought on that particular issue.

second: you need a collision handling strategy baked in to your dv2 implementation paradigm.

i am currently working on a few different alternatives that i will announce at next years wwdvc conference, and if you haven’t signed up, you should – it’s the place to be if you are thinking of data vault, or using data vault in your enterprise data warehousing efforts.

third: in order to “build” dv2 compliant models, you must use a hash

no, sequencing is not compliant with data vault 2.0.  hashing is required.  in fact, replacing sequences with hash results is _required_ to be data vault 2.0 model compliant.

fourth: what are my hash function choices?

there is a list of hash functions here on wikipedia:  http://en.wikipedia.org/wiki/list_of_hash_functions  it is not complete, but has some decent functions listed.  what i recommend is the use of md5 – why? because it is most ubiquitously available across most platforms, and has a decently low percentage chance of duplicate collision – especially when the data is salted on the way in.  now in reality, what i am suggesting is the use of an algorithm that produces 128 bit or larger result.   why?  not for cryptographic reasons – no, we are not trying to protect the data from attack, we are merely attempting to computationally setup a known value “in-stream during load”.

if you don’t like, or don’t approve of md5, then choose another function!! one that you are comfortable with….  choose sha1, or md6, sha-256, spectral, or murmur, or city, or spooky.

just remember: if you choose a function that is not currently available, you have to write the code / extension for your tool set (which is fine, and most code in their source forms are downloadable).  but the maintenance is up to you at that point, not the vendor.

fifth: assign the hash value on the way in to the “staging area”

remember: the staging area might be a hadoop or nosql system, or it might be a relational database.  you have freedom and time at this point to check for “duplicate collisions”, and handle hashing issues before continuing with the loading cycle.

sixth: why don’t range partitioned surrogate keys work?

well the jury is still out on this one, but here is the conundrum:   nearly every “data model”, whether it’s data vault or not, has a parent-child relationship expressed.  even with referential integrity shut off / shut down, the process (if using surrogates) still requires the “creation of the parent surrogate key” before the child row can be inserted.  it also still requires the child row to “look up” / be dependent on the parent value before the insert can take place.

this is a problem in cross-platform (heterogeneous environments), this is also a problem in large volumes arriving at high velocity.  anything that introduces dependencies to the process, will slow it down (sometimes significantly), and the fact still remains, that surrogate key lookups introduce parent-child dependencies in the chain, not to mention “increasing caching”, use of temp areas, sorting and clustering needs, increase in cpu power, etc…

hashing is currently the only known technique to allow a child “key” to be computed in parallel to the parent “key”, and be loaded independently of each other.

sixth: want to argue these point?  do you homework, look up the mathematics for yourself.

the point is, until the “conundrum” i posed above, is solved, surrogates remain (and will continue to remain) a bottleneck in the loading processes of massive data sets at high speed of arrival.  if you have a mathematical solution to the conundrum, i can gather it would be very valuable as a mathematical proof – made available to the world, and i would encourage you to publish it, and patent your solution.  until then, hashing is the only way to truly achieve near linear scalability.  by the way, range-partitioned surrogate keys (sk’s, sequences) work as long as all the data lives on a homogeneous platform (single database environment).

if you are happy with surrogates, and don’t have performance problems, then perhaps you only need methodology and other components of dv2 – and not the modeling pieces.


hope this helps,

dan linstedt


Tags: , , ,

One Response to “#datavault 2.0 Hashing and Sequences Round 2”

  1. Albert Garcia Diaz 2017/01/04 at 12:58 am #

    So most of the relational databases perform better with INT cause they can find the value using binary algorithm (check if the value is in the first half of all the rows, if so, checks again inside the half of that half, etc), or even TS following similar procedures. In case of hashing, it is an extremely random characters, not sorted at all, so it has to go one by one, all of them, until it founds the one that you are looking for. If instead of hashing, we just put and link through the natural key, at least, when sorting that string, you’ll get much faster into that row, cause you are avoiding so many that doesn’t start with x and then avoid the ones that doesn’t start with y and so forth.

    So my question is, if we don’t have 100+ char BK (which is not very common) there is no gain at all hashing, we should use the normal natural key, and the database will find the value much faster, cause you can sort it and eliminate so many rows that when scanning are not corresponding to the value that you want.


Leave a Reply