people arranging the important part

#datavault 2.0, Hashes, one more time

to all: i must apologize for the public display of conflict that has occurred on twitter.  it was unprofessional of me. that said, i cannot stand by and let someone take my statements out of context.  let’s talk brass tacks here (lets get down to facts shall we)?

note: all of these statements have been and are taught in the cdvp2 (certified data vault 2.0 practitioner) class since the beginning of the course when hashes were introduced.   you can find out more about the class and where to take it here:

if you want to know more about hashes, please read the following links:  (it is not my personal math, the mathematics of hashes and collisions need to be taken from mathematicians).

fact 1: core definition of a hub is: a unique list of business keys (nothing more, nothing less), no sequences, no hashes, no load dates, no record sources. (since 1995 before general public publications)

fact 2: sequences were added to dv1.0 in 2001 when i first published to allow effective joins on all platforms except teradata.

fact 3: teradata uses hashes to bucket data on nodes, amps, modules, disk, segments, sectors etc..  read more here:

fact 4: teradata joins on the internal hashes created from the column(s) selected to be a pi (primary index), teradata does not join on literal values, even if sequences are in place.  read more here:

fact 5: sap hana now does the same thing, hashing distribution and bucket columns, and joining internally on the hashes.

fact 6: hashes and hash collisions are mathematically defined.  i did not create any of the hash algorithms, nor did i create the chances or probabilities of collisions.  read more here:

fact 7: yes, keys can collide if using a hash.  if a hash is chosen, then a collision mitigation strategy must be applied in order not to lose data.  yes, the mitigation strategy may include a reverse hash of the key columns.  however, that said: hash collision strategies must be consistent in their resolution approach – otherwise the data will not match on the next load and queries could pull back unwanted row sets.

fact 8: sequences bottleneck load processes at certain levels of volume, requiring re-engineering and re-design of your entire solution in order to scale.  volume levels vary depending on system size, hardware size, and tuning abilities of the dba(s) and system administrator(s).

fact 9: mpp systems like sap hana, teradata, and kudu require hashes to distribute data across the nodes effectively.

fact 10: the hash algorithms that these engines apply (teradata, kudu, sap hana) are not available on other platforms, they are not heterogeneous – therefore if selected or applied, are subject to internal use on that platform alone.

fact 11: sequences were replaced with hash keys in dv2 as a way to execute joins in large scale and global distributed systems, as well as cross-platform heterogeneous joins. please note: hashes are not a replacement for encryption.

fact 12: you can *always* join on natural or business key columns, this will always work in any database.  it may not perform, but it will work.

fact 13: natural / business key joins will be faster than joins on hash keys if the length of the business key field is less than (smaller) than the length of the hash and if it is not a multi-column predicate.

fact 14: joins on sequences will always be faster if (again) the byte length that holds the sequence is smaller than the byte length holding the hash key.  (see fact 8 above as to why sequences are deprecated in large scale solutions)

fact 15: in case it was missed, hashes, and sequences are not part of the core architecture, never were – they are true surrogate values  (see fact 1).

fact 16: load dates, record sources are system driven fields that are there solely for the mechanical purposes of tracing data when something goes wrong.  because data lineage at a row level is important.

fact 17: hashes or sequences are system driven fields that hold no business value, and no meaning to the business.  once generated in the data vault should never leave the data vault or be shown to the business users.  hashes are there to overcome scalability and distributed data issues (when platforms don’t support “joins on natural / business keys through hashing underneath”).

sequences were there in data vault 1.0 to support sql select joins (see fact 11).   neither were ever part of the core architecture, as joins on natural keys always work (may not always perform).  (see fact 1)

fact 18: hash storage in oracle and sqlserver can be converted to a fixed binary, resulting in faster joins.  performance gains will vary depending on tuning and hardware of the platform.

fact 19: md5 has been a suggested algorithm due to it’s well-rounded, well-tested and availability on platforms.  that said, md5 is being deprecated by database vendors due to it’s lack of security (not because of it’s use as a hash key).  therefore, it is suggested sha-1 or sha-2 be applied instead.

fact 20: each hub is it’s own unique set of business keys. therefore, each hub must be treated as it’s own collision probability if hash keys are chosen.  it is not mathematically correct to consider the entire set of all business keys across the business as a single set (due to the fact they are split in to separate hubs in the first place).

fact 21: the data vault standard must work on unstructured and multi-structured data.  elements such as images, video files, audio files, images, and documents usually do not contain a “business key”.  documents may be the exception here, and may contain many business keys.

the question then becomes: can an independent, unchanging business key be created and assigned so that delta’s can be detected downstream?  hashes work here, where surrogates do not.  hashes can be calculated based on specific filtered content (for example: an image of a face, points of interest can be measured and a hash can be created equating to a business key).

the dv2 standard is designed to handle multiple “big data” problems, including unstructured and semi-structured information, and be able to tie them back directly to the relational and structured systems so that ad-hoc query and analysis may take place.

conclusion / commentary:

data vault 2.0 standards accommodate for: identification of unstructured, and multi-structured data, “big data huge scale” solutions, application of petabyte systems, and global distributed data computation / identification so that distributed joins can work.

yes hash algorithms may have collisions, and yes, causing the architect to design a hash collision mitigation strategy.

in the end – creating a unique list of business keys is easy (except when it comes to unstructured data sets).  it can be done in parallel, it can be distributed, it can be joined on..  that’s not the problem.  the problem is: how to join on natural / business key fields when they are “large” (50+ characters, or they are stored in multiple fields – 2 or more fields).

what if there are no business keys or no keys at all?  without a hash, the data cannot be identified.

it is a technical platform issue and has zero to do with the business application of the data set.  due to this fact, sequences have long been assigned to compensate, when in fact the platform should be creating better join optimizers and better access to data under the covers.  two (and i’m sure others) platforms already do this: sap hana, and teradata by hashing the natural keys / business keys under the covers for you.

relatively large byte structures for natural and business keys cause string comparisons which are relatively slow or slower than numeric comparisons, hence the “default choice” for applying surrogates.  the issue is: setting the surrogates in place for “child records”, so that the joins and dependencies work.

this causes a mathematical issue during load – to be more precise, it causes a dependency which bottlenecks on orders of scale.  of course the “bottleneck” won’t appear until the limits of processing power are overwhelmed by the amount of data being sought on a row by row basis (lookup of parent key).

the other restriction to sequences, is they lock the process design down – to a single instance to generate the “parent key”.  which means, in distributed data base systems (like mpp or geograpic distribution), a lookup has to occur for each and every child record.  this, is infeasible under heavy load where parallel systems designs can’t afford the time to exercise this technique.

so how does an architecture or design time paradigm solve it?  one possible best practice today – hashing.  if there were a better technique for uniquely identifying records that scaled with volume, and handled mpp solutions, then great – the standard will change and it would be chosen and applied.

today, this is the best known technique.  sure there are “issues”, but there are issues with every technique we apply to systems.  just as there are issues with sequence numbers.  unfortunately the issues with sequences bottlenecking have no “mitigation strategy” other than to consolidate on a single server all data (which with country based regulations isn’t always possible), and scale that single server (which introduces a law of diminishing returns on investment).

there is a facet of hashes called a perfect hash (not my invention by any means).  more reading on this subject can be done here: there are problems with this approach too.  one being not all vendors even have this function coded or available, which means your edw / it team must code the function and test it.

another being its length of output, and another being it’s time to compute the hash value (which vary depending on the perfect hash algorithm selected, and the hardware size it’s being run on).  perfect hashes are not a feasible approach (in my opinion) to solving unique key identification issues, much less identifying unstructured data sets.

the data vault standard for hubs

has been, always will be: a unique list of business keys – hence not requiring surrogates or hashes or any thing else in the hub.

there are myriads of benefits to this approach (which has been in place since 1995 long before i ever published standards to the public).  one of which being natural joins, another being an advanced concept called floating satellites, and more… however,the hardware underneath must support these joins efficiently at scale, as well as support efficient at-scale loading processes.

i hope this finally puts this entire issue to bed.  again, data vault 2.0 standards are taught in full in my cdvp2 classes by my authorized instructors.  to pull any of this knowledge out of context is to misunderstand the facts in this list.

thank you for your valuable time,

dan linstedt


Tags: , , , , , , , , , , ,

11 Responses to “#datavault 2.0, Hashes, one more time”

  1. Clifford Heath 2017/11/13 at 12:03 am #

    There is no known technique to create a “perfect hash” function that can deal with arbitrary data. Given a data set, there are techniques for computing a perfect hash function, but they only work if you know all possible values ahead of time. There is also a perfect hash function that always works if you know the maximum size of an input value; simply pad all values up to that size :). It’s just not useful however; the output is larger than the input. We want a function which (a) never produces a collision (this is what “perfect” means here) (b) has few unused values in the output (a perfect hash function that has no gaps is called a “minimal perfect hash”) and (c) has a compute time that doesn’t depend on the number of values.
    (a) with (c) is possible only when you know all the input values. (b) is always possible; just make a numbered list of all input values and search it – but then you fail at objective (c).

  2. Dan Linstedt 2017/11/13 at 1:27 am #

    Hi Clifford,

    you are correct. I’ve been following perfect hashing for years. Apparently recently they have removed the thought of ever finding an open ended perfect hash. I know they have been working on these algorithms, but it appears as no such luck.

    Now, that said, I appreciate your comment. What I will say is this: The pure definition of a Hub has been and always will be: BUSINESS KEY driven. A Unique List Of Business Keys.

    That statement is beginning to ring true when DV is implemented on a Hadoop Platform, SAP Hana, or Teradata platform. Why? Because these platforms offer “hashing of the primary index” or hashing of the business key for bucket based partitioning. So from that perspective, the need or desire for both surrogates OR hashes is diminishing. Which leaves us with the question: what about traditional relational technology?

    Traditional relational databases still need some form of “surrogate key” – whether it be sequence, or hash based… But do they? What about simply concatenating the business key composite fields together, and storing that? There are several problems with this approach, I will detail those in yet another blog entry.

    Thanks for your feedback,
    Dan Linstedt

  3. Stephen Schleicher 2017/11/13 at 6:22 pm #

    We currently convert the result from hashbytes to a bigint. Works great for integer based joins and partition distribution with no collisions so far. Seems like this would be a good solution based on the deprecation of older SQL hashing algorithms in SQL 2016 too. Is converting to bigint a bad idea because it increases the chance of collisions?

  4. Dan Linstedt 2017/11/14 at 12:30 am #

    Hi Stephen, This *can* cause a significant problem… Be very very careful! In fact, if the BigInt is “too small” for the result of the conversion then the database will simply truncate the remaining bits, and NOT give you an error. I believe BigInt in SQLServer is too small to house the full result, but you can test this by converting the BigInt back to the Hash value and making sure they match. I *do not* recommend this approach at all.
    Yes, it is a bad idea but not because it increases the chance of collisions (this happens as a result of truncation of bits), but rather, the truncation of bits (loss of data) – this is a serious serious issue.

    THE BEST Idea / resolution strategy is to revert to a single field (concatenation of business keys) – this is cross platform, deterministic, and works without any collisions ever. (go back to the ORIGINAL definition of a Hub Structure)

    Hope this helps,

  5. Stephen Schleicher 2017/11/14 at 9:35 am #

    Huge help…thanks. We will reverse the conversion and see what shakes out, but moving away from big int is certainly the best route based on the truncated bits. Really appreciate the guidance Dan!

  6. Stephen Schleicher 2017/11/30 at 1:49 pm #

    Hi Dan,

    We have a question regarding adding the business key to the hashdiff. Originally we only hashed the attributes that make up the descriptive data (not including the business key). In your book, you seem to imply that adding the business keys to the hash diff will perform better when doing the lookups against the satellite in the raw vault because you only need to left join on the LATEST hashdiff (not the business key) and then only insert a new record where the hash diff was not found. I see performance value in that, however, the example in the book where you load satellites doesn’t seem to reflect this if I’m reading it correctly. It still joins based on the business key and then has to compare the hash diffs in addition to a NULL check. If we want to take advantage of adding the business key in the hashdiff, wouldn’t we only need to join based on the LATEST hashdiffs in the satellite (not the business key). Any record not found as a result of the left outer join on the hash diff would be an insert into the satellite (no need to do both a NULL check AND a stage.hashdiff != sat.hashdiff).


  7. Dan Linstedt 2018/02/07 at 7:42 am #

    Hi Stephen, No conclusive evidence has been found in performance improvements by adding business keys to the Hash Differences. Therefore, over the past several years, I have pulled it from the specification. However, you are welcome to perform your own tests and share the results. That said: IF you perform tests, please test with at least 1 Billion records in the Satellite. Otherwise the performance numbers simply won’t show accurate answers due to pre-caching of the database engines. Sincerely, Dan

  8. Zeev 2018/05/02 at 4:54 pm #

    Not sure if that’s the place, but I am not really sure how real parallelism can be achieved in for example in Hadoop. In your book you say that “Locking on a row-level and executing (and committing) only micro-batches enable the use of full parallelized execution without further handling.”
    You can’t do that in Hadoop, you don’t have constraints, so we are still left with running one batch at a time.
    Am I missing something?

  9. Dan Linstedt 2018/05/03 at 9:43 am #

    Yes, actually – there is a great deal of information.

    You CAN run parallel processes in Hadoop (as I’m sure you are already aware of), I think you are referring to the fact you can’t turn on referential integrity in Hadoop. We do use parallel inserts from multiple streams in real-time, there are no locking elements in Hadoop.

    I am curious to have you explain what your thoughts are around “not sure how real parallelism can be achieved in Hadoop” or in other databases…


  10. Zeev 2018/05/04 at 8:19 am #

    I will clarify, sorry.
    Let’s say I am running the ETL/ELT process for multiple periods I’m parallel (separate batches).
    It may be that some business keys will appear in several batches.
    If all batches write in parallel to a table, you can’t eliminate duplications in the data.


  11. Dan Linstedt 2018/05/24 at 3:09 pm #

    Actually i am confused again. This also is not true, if you follow the standard design patterns that I’ve written in to the book: Building a Scalable Data Warehouse with Data Vault 2.0, you can definitely ALWAYS eliminate duplicates. Ok – except possibly Hadoop, because in Hadoop there are no locks on the target table, and no enforcement of Primary Keys or Primary Key violations. It is technically possible in Hadoop for a Satellite (and even a Hub or a Link for that matter) to receive inserts of the same data at the same time. However, this should be rare, especially if you are writing to Hive Managed “files/tables” and not plain HDFS table structures. Why? because the Satellites always find the latest row, and should be doing comparison to avoid loading dupes. Hubs and Links, should always be inserting ONLY what does not exist.
    If the standards are broken, then you do not have a Data Vault standard implementation, and I encourage you to go back and revisit the standards.


Leave a Reply