to all: i must apologize for the public display of conflict that has occurred on twitter. it was unprofessional of me. that said, i cannot stand by and let someone take my statements out of context. let’s talk brass tacks here (lets get down to facts shall we)?
note: all of these statements have been and are taught in the cdvp2 (certified data vault 2.0 practitioner) class since the beginning of the course when hashes were introduced. you can find out more about the class and where to take it here: http://datavaultalliance.com
if you want to know more about hashes, please read the following links: (it is not my personal math, the mathematics of hashes and collisions need to be taken from mathematicians).
- http://preshing.com/20110504/hash-collision-probabilities/ a great guide on number of unique values needed in sha-1 (160 bits) to hit a 50% collision possibility: 1.42 x 10^24
- http://big.info/2013/04/md5-hash-collision-probability-using.html (also covers information about the birthday problem for md5 – according to his calculations, you need 2.2×10^19 unique input values to reach a 50% chance of collision, don’t argue with me, argue with preshing)
- https://ad-pdf.s3.amazonaws.com/papers/wp.md5_collisions.en_us.pdf more about md5 and collisions (remember, we are not proposing md5 as a secure algorithm, just as a surrogate key replacement)
- https://en.wikipedia.org/wiki/birthday_problem (read about the birthday problem which reduces the number of unique values needed to reach a collision)
- https://www.theregister.co.uk/2017/02/23/google_first_sha1_collision/ google created a sha-1 collision with 6,610 years of processor time + 110 years of gpu time (for one collision).
fact 1: core definition of a hub is: a unique list of business keys (nothing more, nothing less), no sequences, no hashes, no load dates, no record sources. (since 1995 before general public publications)
fact 2: sequences were added to dv1.0 in 2001 when i first published to allow effective joins on all platforms except teradata.
fact 3: teradata uses hashes to bucket data on nodes, amps, modules, disk, segments, sectors etc.. read more here: https://www.tutorialspoint.com/teradata/teradata_hashing_algorithm.htm
fact 4: teradata joins on the internal hashes created from the column(s) selected to be a pi (primary index), teradata does not join on literal values, even if sequences are in place. read more here: http://www.info.teradata.com/htmlpubs/db_ttu_14_00/index.html#page/sql_reference/b035_1142_111a/ch03.125.092.html
fact 5: sap hana now does the same thing, hashing distribution and bucket columns, and joining internally on the hashes.
fact 6: hashes and hash collisions are mathematically defined. i did not create any of the hash algorithms, nor did i create the chances or probabilities of collisions. read more here: https://en.wikipedia.org/wiki/hash_function
fact 7: yes, keys can collide if using a hash. if a hash is chosen, then a collision mitigation strategy must be applied in order not to lose data. yes, the mitigation strategy may include a reverse hash of the key columns. however, that said: hash collision strategies must be consistent in their resolution approach – otherwise the data will not match on the next load and queries could pull back unwanted row sets.
fact 8: sequences bottleneck load processes at certain levels of volume, requiring re-engineering and re-design of your entire solution in order to scale. volume levels vary depending on system size, hardware size, and tuning abilities of the dba(s) and system administrator(s).
fact 9: mpp systems like sap hana, teradata, and kudu require hashes to distribute data across the nodes effectively.
fact 10: the hash algorithms that these engines apply (teradata, kudu, sap hana) are not available on other platforms, they are not heterogeneous – therefore if selected or applied, are subject to internal use on that platform alone.
fact 11: sequences were replaced with hash keys in dv2 as a way to execute joins in large scale and global distributed systems, as well as cross-platform heterogeneous joins. please note: hashes are not a replacement for encryption.
fact 12: you can *always* join on natural or business key columns, this will always work in any database. it may not perform, but it will work.
fact 13: natural / business key joins will be faster than joins on hash keys if the length of the business key field is less than (smaller) than the length of the hash and if it is not a multi-column predicate.
fact 14: joins on sequences will always be faster if (again) the byte length that holds the sequence is smaller than the byte length holding the hash key. (see fact 8 above as to why sequences are deprecated in large scale solutions)
fact 15: in case it was missed, hashes, and sequences are not part of the core architecture, never were – they are true surrogate values (see fact 1).
fact 16: load dates, record sources are system driven fields that are there solely for the mechanical purposes of tracing data when something goes wrong. because data lineage at a row level is important.
fact 17: hashes or sequences are system driven fields that hold no business value, and no meaning to the business. once generated in the data vault should never leave the data vault or be shown to the business users. hashes are there to overcome scalability and distributed data issues (when platforms don’t support “joins on natural / business keys through hashing underneath”).
sequences were there in data vault 1.0 to support sql select joins (see fact 11). neither were ever part of the core architecture, as joins on natural keys always work (may not always perform). (see fact 1)
fact 18: hash storage in oracle and sqlserver can be converted to a fixed binary, resulting in faster joins. performance gains will vary depending on tuning and hardware of the platform.
fact 19: md5 has been a suggested algorithm due to it’s well-rounded, well-tested and availability on platforms. that said, md5 is being deprecated by database vendors due to it’s lack of security (not because of it’s use as a hash key). therefore, it is suggested sha-1 or sha-2 be applied instead.
fact 20: each hub is it’s own unique set of business keys. therefore, each hub must be treated as it’s own collision probability if hash keys are chosen. it is not mathematically correct to consider the entire set of all business keys across the business as a single set (due to the fact they are split in to separate hubs in the first place).
fact 21: the data vault standard must work on unstructured and multi-structured data. elements such as images, video files, audio files, images, and documents usually do not contain a “business key”. documents may be the exception here, and may contain many business keys.
the question then becomes: can an independent, unchanging business key be created and assigned so that delta’s can be detected downstream? hashes work here, where surrogates do not. hashes can be calculated based on specific filtered content (for example: an image of a face, points of interest can be measured and a hash can be created equating to a business key).
the dv2 standard is designed to handle multiple “big data” problems, including unstructured and semi-structured information, and be able to tie them back directly to the relational and structured systems so that ad-hoc query and analysis may take place.
conclusion / commentary:
data vault 2.0 standards accommodate for: identification of unstructured, and multi-structured data, “big data huge scale” solutions, application of petabyte systems, and global distributed data computation / identification so that distributed joins can work.
yes hash algorithms may have collisions, and yes, causing the architect to design a hash collision mitigation strategy.
in the end – creating a unique list of business keys is easy (except when it comes to unstructured data sets). it can be done in parallel, it can be distributed, it can be joined on.. that’s not the problem. the problem is: how to join on natural / business key fields when they are “large” (50+ characters, or they are stored in multiple fields – 2 or more fields).
what if there are no business keys or no keys at all? without a hash, the data cannot be identified.
it is a technical platform issue and has zero to do with the business application of the data set. due to this fact, sequences have long been assigned to compensate, when in fact the platform should be creating better join optimizers and better access to data under the covers. two (and i’m sure others) platforms already do this: sap hana, and teradata by hashing the natural keys / business keys under the covers for you.
relatively large byte structures for natural and business keys cause string comparisons which are relatively slow or slower than numeric comparisons, hence the “default choice” for applying surrogates. the issue is: setting the surrogates in place for “child records”, so that the joins and dependencies work.
this causes a mathematical issue during load – to be more precise, it causes a dependency which bottlenecks on orders of scale. of course the “bottleneck” won’t appear until the limits of processing power are overwhelmed by the amount of data being sought on a row by row basis (lookup of parent key).
the other restriction to sequences, is they lock the process design down – to a single instance to generate the “parent key”. which means, in distributed data base systems (like mpp or geograpic distribution), a lookup has to occur for each and every child record. this, is infeasible under heavy load where parallel systems designs can’t afford the time to exercise this technique.
so how does an architecture or design time paradigm solve it? one possible best practice today – hashing. if there were a better technique for uniquely identifying records that scaled with volume, and handled mpp solutions, then great – the standard will change and it would be chosen and applied.
today, this is the best known technique. sure there are “issues”, but there are issues with every technique we apply to systems. just as there are issues with sequence numbers. unfortunately the issues with sequences bottlenecking have no “mitigation strategy” other than to consolidate on a single server all data (which with country based regulations isn’t always possible), and scale that single server (which introduces a law of diminishing returns on investment).
there is a facet of hashes called a perfect hash (not my invention by any means). more reading on this subject can be done here: https://en.wikipedia.org/wiki/perfect_hash_function there are problems with this approach too. one being not all vendors even have this function coded or available, which means your edw / it team must code the function and test it.
another being its length of output, and another being it’s time to compute the hash value (which vary depending on the perfect hash algorithm selected, and the hardware size it’s being run on). perfect hashes are not a feasible approach (in my opinion) to solving unique key identification issues, much less identifying unstructured data sets.
the data vault standard for hubs
has been, always will be: a unique list of business keys – hence not requiring surrogates or hashes or any thing else in the hub.
there are myriads of benefits to this approach (which has been in place since 1995 long before i ever published standards to the public). one of which being natural joins, another being an advanced concept called floating satellites, and more… however,the hardware underneath must support these joins efficiently at scale, as well as support efficient at-scale loading processes.
i hope this finally puts this entire issue to bed. again, data vault 2.0 standards are taught in full in my cdvp2 classes by my authorized instructors. to pull any of this knowledge out of context is to misunderstand the facts in this list.
thank you for your valuable time,