There’s been some discussion, nay, arguments of late – around whether or not to replace surrogates in DV2 models with Hashes, or to simply use natural keys… ok – perhaps Natural keys is too strong a word, maybe business keys is a softer side (as frequently found, source system models these days usually contain surrogates acting as business keys, and natural keys are no where to be found). This blog entry (albeit short) will explain the pros and cons of each. I welcome you to add your thoughts to this post (on LinkedIn “Data Vault Discussions” group).
Let’s take a look at how we define each.
Natural keys: data elements or attributes (may be composite, may not be composite) which when viewed by the human intellect appear to contain some form of descriptive metadata, enabling deciphering of the true meaning or representation without “adding” additional context (ie: without looking up additional information to figure out what it means).
Business Keys: Any “unique identifier” that is presented to the business user, for the purposes of uniquely finding, locating, and understanding data sets. It *may* be a sequence number, or it *may* be a natural key. Frequently these days, Business keys (as bad as it sounds) are generally sequences provided by, generated by, and maintained by a single source system application.
One part of the Data Vault 2.0 Standard requires changing from Surrogates in the Data Vault Model, over to Hash Keys as the primary key for Hubs, and Links. The question is: why not simply use the Business Keys OR the natural keys? Why go through the trouble of hashing the Natural or Business Keys to begin with?
Let’s examine this a little deeper:
#1) What is driving the need to switch OFF sequences to begin with?
- Sequences cause bottlenecks in Big Data Solutions.
- Sequences require dependencies in loading cycles, slowing down real-time feeds, regardless of whether or not referential integrity is turned on in the database physical layers.
The bottlenecks are to a Hybrid system, one that uses Hadoop (or NoSQL) for staging and deep analytics. These bottlenecks are because of the dependency chain mentioned in item #2 above. The load to “hadoop” or NoSQL requires a lookup on the Hub or Link sequence in the relational system before it can “insert” or attach it’s data. Sure, you can copy the file in to Hadoop, but the “sequence” has to be attached before it can be joined back to the relational system. This defeats the purpose of having a Hadoop or NoSQL platform to begin with. The simple fact is, it has to “LOOKUP” one or more sequences from the relational database, what a pain!
In a fully relational system, larger loads require “longer” cycles for loading. Why? because the Hubs must be inserted BEFORE the links can use the sequences, and BOTH must be inserted BEFORE the Satellites can be inserted. Why? because the Satellites depend on the Sequences generated by the parent tables. This happens EVEN WITH referential integrity shut off at the database level.
Neither of these situations are sustainable, nor even advisable in big data or high velocity systems, forcing the data architects to ultimately “re-engineer” the system for growth or arrival speed. Oh, there’s one more issue: MPP data distribution. If data is distributed by “ranges” of sequences, then there is a chance of forcing a Hot Spot to occur in an MPP environment. Nothing that anyone wants, and again defeats the purpose of having MPP in the first place.
Hashing is currently in use (behind the scenes) by many database engines in the MPP world (both relational and non-relational MPP systems) to distribute data along “buckets”, and preferably, reach an average equal distribution in order to avoid the hot-spot problem. That, my friends, still leaves us with the bottleneck and dependency issues to deal with…
Enter Data Vault 2.0
Data Vault 2.0 Modeling (the modeling component ISN’T the only thing that has changed for Data Vault 2.0), provides a standard that states: replace all sequence numbers with the results of a Hash Function. Then, goes on to suggest several hash functions to choose from: MD5, MD6, SHA1, SHA2, etc… And the standards document / 5 page white paper I’ve made available (to my students only on LearnDataVault.com) describes the best practice of how to implement it properly, and cross-platform.
The result of the suggest hash: MD5 is a 128 bit quad-word. In reality, it’s two separate Big Int’s (if you will). Unfortunately in order to handle this kind of data-type natively, the Operating system and / or the Database engine would need to declare and provide a numeric capable of handling 128 bits in length.
Due to the fact that this is simply not feasible today, we change the result of the hash FROM the binary representation in to a CHAR(32) (128 bits converted to ASCII HEX STRING). Again, all of this is covered for the students on LearnDataVault.com – those who buy a class, or buy the Super Charge book PDF from the site.
Back to the point and the comparison….
Hashing instead of sequencing means that we can load in complete 100% parallel operations to ALL HUBS, ALL LINKS, ALL SATELLITES, and enrich all Hadoop based or NoSQL documents IN PARALLEL at the same time (without dependencies). Providing of course that the proper hash values have been “attached” to the staging data. It also then allows us to join across multiple heterogeneous platforms (From Teradata to Hadoop to IBM DB2 BCU to SQLServer, to CouchBase to MySQL and more). (there is more to this discussion, about collisions, and so on – that are beyond the scope of this article at this time).
Back to the original question…
So why not simply use “business keys” or “natural keys”? Why Hash at all?
- Not all database engines (relational or non-relational) have the capability or capacity to use natural or business keys for data distribution
- Not all database engines (relational or non-relational) have the capability to execute EFFICIENT joins on natural or business keys
- And the BIGGEST reason of all: in a Data Vault Model we leverage a many to many relationship table called a Link. It is made up of multiple keys (from different Hubs). To join ALL these keys together, would mean replicating the business keys to the Link – resulting in (most cases) a variable length multi-part, multi-datatype key set, which would ultimately perform slower than a concisely measured, precise length field. For Satellites, it means replicating the business keys to each of the Satellites as well.
Another statistic: 80% of the “business keys” are variable length character strings (most in Unicode these days, making them twice as long).
The reality of it is?
The business keys that are “longer” than 32 characters in length on average, will perform slower over the life of large data sets than those which are 32 byte hex represenations. Of course, the 32 byte ASCII Hash hex strings will perform “slightly” slower than the sequences (which are smaller yet), but more than make up the query performance difference by resolving the other issues mentioned above.
Hashes, like it or not, serve a purpose. Natural keys and business keys, as good as they are cause additional join issues (like case sensitivity, code-set designation, and so on). Believe me, solving a heterogeneous join (from SQLServer for instance to Hadoop) and dealing with two different code-pages, can cause other problems that JDBC won’t solve.
When to use Natural or Business Key joins instead of Hashes?
When you are embedded on a single system (like Teradata, or GreenPlum for instance) for your entire data warehouse solution, and don’t have “cross-technology” or heterogeneous technology joins to accomplish.
When to use Hashes instead of Natural or Business Keys?
When you need to join heterogeneous environments, or resolve Unicode and case sensitivity issues, or the length of the keys exceeds 32 characters (non-unicode) in length.
In the end, it’s your call if you want to “replicate” business keys/natural keys to Link Structures and do away with Hashes altogether. It would certainly solve the problem, but if you go this route, please do NOT return to using sequences, oh – and be aware of the issues mentioned above.
Hope you’ve enjoyed this entry, and I’m looking forward to your feedback.