once again, the debate rages on…. i’ve now been inundated with requests and demands, and issues around hashing, so i wish to clear the air. in this entry i’ll dive in to sequences (again) and big data, and the problems they pose with big data and high velocity systems.
first, let me get this off my chest:
i never claimed that hashes cause zero collisions. to my knowledge and understanding, there is currently, no such thing as a “perfect hash” that will produce a guaranteed unique identifier for a single input. people in the community at large seem to forget, or have amnesia when it comes to this statement, and insist on continuing to argue with me that i “somewhere / somehow” made this claim. let me be perfectly clear: i never said that hashes have zero collisions!!
so please, don’t bring that up again. i will not spend any further cycles or thought on that particular issue.
second: you need a collision handling strategy baked in to your dv2 implementation paradigm.
i am currently working on a few different alternatives that i will announce at next years wwdvc conference, and if you haven’t signed up, you should – it’s the place to be if you are thinking of data vault, or using data vault in your enterprise data warehousing efforts.
third: in order to “build” dv2 compliant models, you must use a hash
no, sequencing is not compliant with data vault 2.0. hashing is required. in fact, replacing sequences with hash results is _required_ to be data vault 2.0 model compliant.
fourth: what are my hash function choices?
there is a list of hash functions here on wikipedia: http://en.wikipedia.org/wiki/list_of_hash_functions it is not complete, but has some decent functions listed. what i recommend is the use of md5 – why? because it is most ubiquitously available across most platforms, and has a decently low percentage chance of duplicate collision – especially when the data is salted on the way in. now in reality, what i am suggesting is the use of an algorithm that produces 128 bit or larger result. why? not for cryptographic reasons – no, we are not trying to protect the data from attack, we are merely attempting to computationally setup a known value “in-stream during load”.
if you don’t like, or don’t approve of md5, then choose another function!! one that you are comfortable with…. choose sha1, or md6, sha-256, spectral, or murmur, or city, or spooky.
just remember: if you choose a function that is not currently available, you have to write the code / extension for your tool set (which is fine, and most code in their source forms are downloadable). but the maintenance is up to you at that point, not the vendor.
fifth: assign the hash value on the way in to the “staging area”
remember: the staging area might be a hadoop or nosql system, or it might be a relational database. you have freedom and time at this point to check for “duplicate collisions”, and handle hashing issues before continuing with the loading cycle.
sixth: why don’t range partitioned surrogate keys work?
well the jury is still out on this one, but here is the conundrum: nearly every “data model”, whether it’s data vault or not, has a parent-child relationship expressed. even with referential integrity shut off / shut down, the process (if using surrogates) still requires the “creation of the parent surrogate key” before the child row can be inserted. it also still requires the child row to “look up” / be dependent on the parent value before the insert can take place.
this is a problem in cross-platform (heterogeneous environments), this is also a problem in large volumes arriving at high velocity. anything that introduces dependencies to the process, will slow it down (sometimes significantly), and the fact still remains, that surrogate key lookups introduce parent-child dependencies in the chain, not to mention “increasing caching”, use of temp areas, sorting and clustering needs, increase in cpu power, etc…
hashing is currently the only known technique to allow a child “key” to be computed in parallel to the parent “key”, and be loaded independently of each other.
sixth: want to argue these point? do you homework, look up the mathematics for yourself.
the point is, until the “conundrum” i posed above, is solved, surrogates remain (and will continue to remain) a bottleneck in the loading processes of massive data sets at high speed of arrival. if you have a mathematical solution to the conundrum, i can gather it would be very valuable as a mathematical proof – made available to the world, and i would encourage you to publish it, and patent your solution. until then, hashing is the only way to truly achieve near linear scalability. by the way, range-partitioned surrogate keys (sk’s, sequences) work as long as all the data lives on a homogeneous platform (single database environment).
if you are happy with surrogates, and don’t have performance problems, then perhaps you only need methodology and other components of dv2 – and not the modeling pieces.
hope this helps,