frequently asked questions

for data vault modeling, methodology, architecture and implementation.  i will do my best to answer questions here as they come up.  feel free to post new questions, please read the proper category descriptions to keep them in the right place.

have a question?  submit one here.

Data Vault Implementation

i recently saw a post about sql server deprecating older hashing algorithms requiring the newer sha2 versions to be used which would increase the hash. i tried to reply to that but could not for some reason so i’ll just pitch the question here. we currently get around this by converting the value from hashbytes to a bigint. that has the plus of introducing integer based joins versus character based hash joins as well as providing good partition distribution, but we have always wondered if it increases the risk of collisions. we’ve tested this with all the algorithms and have yet to come across a collision…keeping our fingers crossed.. i actually wondered why this was not mentioned in the book as an alternative. is it because it could increase the chance of collisions or some other consideration dan?

Did you find this FAQ helpful
  • Dan Linstedt says:

    No. Most relational databases (except for Java Implementations) make BigInt too small to carry the full hash converted format.
    They will automatically truncate bits, WITHOUT triggering an error. Please NEVER do this – as it will cause far more problems than it is worth dealing with.
    Bigint: https://docs.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql
    Size is 8 bytes
    Result size of Hash is 16 bytes (DOUBLE!!!)

    You can & should test this result by converting the bigint BACK to a hash, and then comparing the value – as you will find, this will not work.

    Sorry, you are causing serious problems with your hashing by doing this.

  • Comment on this FAQ


    hashbytes and md5 deprecated in sqlserver 2016

    i found this information in sql server 2016 documentation.

    beginning with sql server 2016, all algorithms other than sha2_256, and sha2_512 are deprecated. older algorithms (not recommended) will continue working, but they will raise a deprecation event.

    would using this have a performance impact? would you recommend using either of the supported algorithms as a dv standard?

    Did you find this FAQ helpful
  • Dan Linstedt says:

    MD5 and SHA-1 are still around, in fact, SHA-1 has become the new acceptable standard. It is still fast enough, but it helps reduce the possible collisions. Even though vendors have deprecated these functions, they still work.

  • Archie says:

    Dan, do you have any thoughts now?

  • Dan Linstedt says:

    Answer: YES. This particular change to Hashes in SQLServer 2016 WILL impact performance in a negative fashion. Not just for loading but for querying.

    In reality, we truly WANT to leverage Business Keys. Sadly, SQLServer does not “hash bucket” the business keys for partitioning under the covers. Teradata, SAP Hana, Kudu, and Hive are all capable of hashing by Business Key. So, I will look deeper at this function in an attempt to find a better solution.

  • Comment on this FAQ

    Load More