wait time

#datavault 2.0 and Hash Keys, Again?

once again, unfortunately, people in the market have chosen to lambast hash keys and data vault 2.0 for it’s use.  i will now attempt (again) to clear the air on this issue.

what is a hash?

learn, read, understand – you can search google for this information, many many many blogs, mathemeticians, and even wikipedia define it for you.

what is a hash key?

simply put?  a surrogate key – consistently generated and assigned for a given combination of business key field(s).  with the expectation that it’s mostly unique.  in data vault 2.0 we replaced sequence numbers with hash keys.  why?

  • sequence numbers break under volume forcing re-design and re-engineering of the architecture
  • sequence numbers don’t allow cross-system joins without “looking up the value and copying it across to other systems”
  • sequence numbers bottleneck on high speed and high volume systems (see point #1 above).

wait, i thought primary keys should always be unique?

yes, they should.  however, there are mathematical issues with hashes as primary keys, which tend to render them mostly unique.  this introduces something called a collision.  if you don’t understand what a mathematical hash collision is, you can read more about it here.

so why did dv2 choose hashes as keys?

really?  again?  the list above wasn’t enough for you eh?  ok…  take a read on the different types of hashing functions, including something called a perfect hash.  there are different types, performing differently that provide mostly unique numbers.  that said, if you are going to choose to use a hash function then you must understand that you need to build a collision strategy.  if you refuse to “know or acknowledge” a collision strategy then what is left for you to use as a primary key?  (more on this in a bit).

dv2 chose hash keys for the following reasons:

  • loading your data vault in parallel over massive volumes
  • loading a geographically distributed (or heterogeneous server distributed) data vault edw in parallel, allowing lazy joins later.
  • semi-encrypting data sets in order to meet country regulations (where clear-text cannot be copied across country boundaries).
  • and #1: not all relational database engines work on business key joins!!!

we already know sequences are dead for large scale systems.  don’t believe me???  why then do mpp solutions (like teradata) hash your data before executing joins?  how about sap hana??  it does the same thing…   in order to scale, the “lookup the parent record before inserting the child”, needs to go….  otherwise, the system will ultimately reach it’s maximum processing capabilities, and bottleneck.

so just add more hardware you say…

wish it were that simple…  it isn’t, not over 200+ tb of data in your warehouse, lookups simply don’t work…  oh yea, what about multi-system distributed data warehouses???  what do you do then?  lookup over the ocean from one country to another?  how do you process terabytes of data this way with a cross-country lookup???  will it sustain performance as your data sets grow?  no, no no no no….  it won’t.

don’t take my word for it, try it yourself.  put 360 billion records in one parent table, and 200 billion records in the child – split them across the ocean, and try to perform lookups.  come back and tell me your performance results….

but wait, there’s more – now do this for 20 loads in parallel, and tell me it performs….  it won’t, it doesn’t.   more hardware reaches a law of diminishing returns.  you won’t see the scalable gains from scale-up, it’s why mpp is scale-out.

but what about natural keys?

natural keys are good if you have them and your platform works with them to join data under the covers and your platform is housed in a single data center and you don’t have to ship clear text data over country boundaries.

systems like sap hana and teradata both work with natural keys / original key values…. why??? because they hash the values under the covers before the join is actually executed.  that’s why.

the difference between dv2 proposed hash is: uniqueness, where hashing of the business / natural keys on teradata is bucketizing (the general use of hashing).

yes, these two systems are fast and can work wonders on natural keys.

wait, i have oracle or sqlserver…  what about them?

well these two database engines don’t hash character columns underneath for distribution or joining processes.  here, the only time a join on a natural key / business key is faster than a join on a hash is simple science:  when the length of the natural/business keys are shorter than the length of the hash key or – if the business/natural key join is on 2 or more fields where the join on hash key is a single field.

now, that said: if you are using dv2 hash keys, then you *should* be aware of this: storing the hash key in fixed binary format is acceptable on both platforms (save them in binary, cuts the storage down by half).

collisions, what about them? what if i can’t afford them?

data loss in a data warehouse that is supposed to be a raw auditable system of record is unacceptable.  to that end, if you are engaging in use of a hash key then you should be smart enough to read about hash collision strategies.  unfortunately the nay-sayers out there in the market place put too much focus on asking the question (what about collisions), and don’t bother making any statements about resolution or strategies that solve the problems…  interesting eh?

ok, so turns out i do teach the best possible solution, and one of the only ones that work:  a) you must watch for collisions, and b) the solution is called a reverse hash of the original string, storing both.

now, i will say this: there are customers with 350+ tb of raw data in dv2 landscape, been watching for hash collisions using both md5 and sha-1 for the past 6 years, billions of records, and have found none so far.

also remember: the collision rate is per hub (because the business keys are split across hubs), so it’s not the full total of all business keys you have in your business that equals or drives the collision ratio, it’s the number of unique keys for each hub.

in conclusion…

hashing is a replacement for sequences.  hashing should not be used on platforms where natural keys work, as long as joins to other environments also work and perform adequately.  hashing your business keys remove dependency on “parent-child lookups”, simplifying the loading processes, and allow them to scale significantly across hundreds of terabytes per process without forcing re-engineering.

again: use natural business keys where platforms support them, and heterogeneous platform joins work & perform in accordance with sla’s….  (ie: teradata / hana).

i’ve said it before, i’ll say it again: sequences are dead!!  in big data, if they weren’t, we would see them attached to everything hadoop or big data related…  sorry data vault 1.0 followers – data vault 1.0 is a standard published over 16 years ago, long before “big data nosql solutions were around”,   data vault 2.0 is the only way to scale enterprise data warehouses properly.

want to know more?  i teach all of this and more (including security, privacy, and division of data sets over global enterprise data warehouses in my class: cdvp2 (certified data vault 2.0 practitioner).

Tags: , , , , , , , , ,

No comments yet.

Leave a Reply