i’m always on the lookout for ways to improve the understandability and readability of the data vault, especially when it comes to solving business user problems or issues. as you know, the data vault model relies on the best possible business keys being chosen for hubs. knowing your business data, understanding the business processes, and linking the two to the data warehouse, all happens through business keys. the more consistent the business keys are (across lines of business and through the business processes), the better the analytics will be, and the better the data vault model will be. in this entry i will be diving in a little bit to the technical aspects of unique identification. i will discuss md5, hashing in general, uniqueness, and sequence numbers.
get familiar with hashing…
let me just start by saying: if you are unfamiliar with hashing, or don’t know what an md5, sha-1, or crc is, you should probably read up on it. there are literally hundreds of sources for this information, but wikipedia has a great set of definitions and explanations. although for ease of simplicity, i will offer this over simplified definition (just for frame of reference and context).
md5, and sha-1, and a few other algorithms are considered to be cryptographic hashes. that is to say: they encrypt data, and then provide a mostly unique finger-print or id for the data set that it ran through it’s mathematical algorithm. the rfc’s for these algorithms generally state that these functions have a probability of producing something called a collision. the collision is when you have two different data values go through the mathematical algorithm, and produce the same result.
crc is of a slightly different class, in that it’s not cryptographically safe. it is in a class called a simple hash, and is known as a cyclical redundancy checksum.
the business of business keys…
business keys are supposed to have meaning to the business user. in a perfect world, they really should be decipherable by individuals who live and work in that industry. for instance, vin (vehicle identification number) numbers are really well known to those who work in the auto industry. some individuals are so adept at reading these numbers they can tell you what some of the sequences mean – define the number for you. for all the things the auto industry may not get right, they have one thing right: a consistent business key for their main product. not only that, but it’s standardized around the world; it’s used in all manufacturers plants for “most” motorized vehicles, including motorcycles, cars, and busses.
this key doesn’t change when it goes from the sales system to the contracts system, it doesn’t change when it’s transferred from the contracts system to the manufacturing system, or any other system in the business. it stays consistent once assigned. this provides for 100% traceability for the life of the vehicle from inception to even “parts selling” in a junk yard somewhere.
we should be so lucky as to have business keys for all our business data, that we would be able to finger print it across all the systems. as it is today, these keys change all the time, but that’s another blog entry for later this week. let’s get back to hashing and business keys, and md5.
why can’t i use md5 as a primary key in the data vault/data warehouse?
i get this question all the time. let me first say this: md5 is not infallable, it is not garaunteed to produce uniqueness – it is garaunteed to fingerprint and provide an encrypted set length of data for a given string of bits. in other words, produce the same hash or number/result for the same input every time. if you are looking for uniqueness in a primary key, you may get lucky – that is until two things happen that increase the chances of collisions.
1) if the bit string you put in to the md5 is shorter than 128 bits (the length of the result of the hash function), you increase the chances of collision by multiple factors. in fact, a more simple way to look at it is: the shorter the input, the greater the likelyhood of producing the same number for two different input values.
2) the other possibility is: scale/size/scope of your edw. the more rows you put in to your data warehouse, the greater the chance of collision, again: see #1 for the initial principle.
for instance: suppose you have a 15 character string as a business key, you’ve already increased your chances of duplicates with rule #1 above. then, add to that: 110 million rows, you’ve now increased your chances of duplicate keys for different values by rule #2. the chances of md5(“ab”) = md5(“ba”) can be fairly high with large numbers of rows being processed, these both are very short strings pushed in to the algorithm.
if you have a finite set of rows that you know will never grow beyond 15 million or so, you might get away with no collisions and a great/speedy way of changing alpha-numeric business keys into fast indexable primary keys. or if you have a very large business key (say 500 characters or longer) maybe an image, or a blob, then the massive length decreases the chances of duplicates being produced by md5 (more unique bits to hash). under these circumstances making md5 results a business key might just work.
right then. i’m off to set md5 as my business keys…
woaah, slow down, wait a minute, you’re moving too fast.
remember this: the main purpose of a business key is to never change (legally) throughout the life of the data that it represents.
tying the business key to the business process, to the data flow in the business process, and also to the physical model will nearly always lead to success in any business intelligence or operational system. it allows full traceability back to the source, and even across the sources. otherwise horizontal pictures of the business cannot be created (but i ramble, and this too is for another blog entry coming up).
do not simply “decide” to set md5 as your business key and run off and implement it. consider the implications first!
what is the issue we are trying to solve here?
the issue is really more of a two part question, one technical, one business.
technical: how do i “detect/lookup/identify” data quickly without all the fuss of “caching lookup tables” or joining to the data in the database?
business: how do i provide a value-added key that has meaning across the business, yet can be used in a technical solution?
ok – there’s one more hidden question for both parts: how can i keep the key from changing once it’s created? this is the biggest question of them all. if you’re using an md5 against a “name” of some sort, then then imagine what happens when the “name” changes… your algorithm for “identifying” data in the edw, creates a new hash key, and can’t trace it back or tie it back to the old data that is still marked as current in your edw!! again, you should only create md5 keys in limited data sets, and large enough keys that won’t contain collisions.
now, if you’re lucky enough to capture the name change and your system can be fed an old-name and the new-name, you can use a hierarchical link (many-to-many hierarchy) table to relate the old information. we call this a same-as-link in data vault speak. now, i’m going to make a very strong statement and take a stand here – because we don’t live in a perfect world, and keeping business keys that “never change” is usually a far-fetched reality.
cdc is absolutely vital to the future of data warehousing and bi. cdc will make a set of changing business keys become a seemless operation for horizontal integration of data in the next-gen data warehouse. cdc must be implemented on all future operational systems.
so is there a time when i can use md5 as a business key in my data vault / data warehouse?
strangely enough, the answer (as it turns out) is yes. i would suggest the following situations are valid and applicable to potentially using md5 as a replacement for surrogates in the data vault: 1) operational data vault, 2) business data vault – where you may have control over the raw same-as links inside the raw data vault, and you are moving the data in to a master data system.
wait a minute, did you just say that a business data vault is a master data system? yes, i did, and if it’s built properly, it can be the foundations for your master data initiative (and a pretty good one at that). but, more on business data vaults in yet another future blog post.
these are the two situations. operational data vaults mean that your application sits on top of the data vault warehouse directly, and therefore gives you complete control over maintiaining/establishing business keys that should never change. this means md5 will work consistently and repeatably for you over time. operational data vaults also give you the option of building cdc directly in to the application (thus allowing business keys to change over time – but be wary of this option, it leads to other problems in the future).
business data vaults rely on cleansed/cleaned and integrated data. they rely on master-business keys being chosen for the hubs. that being the case, md5 should generate mostly unique surrogate keys across the board. the other piece, (if you’re lucky enough to have it or create it) is: cdc or audit-trail feeds are already stored inside the raw data vault and same-as links. this makes it easy to choose a consistent master key for the business data vault.
just remember the final two tennants: length of the business keys make a difference, longer is better. number of business keys also makes a difference – smaller number of keys is better (less chances of collision).
if you account for both of these, or develop a unique mitigation strategy for the second (so it can scale), then you’re home free.
by the way, some research is being done in global unique hash (no, it’s not guid – guid is defined differently, don’t mistake the two). if they ever get to “perfect hash algorithm” that works, and make it public knowledge, i’ll be jumping on that band-wagon to test it out.
what are your experiences with business keys, hashes, and collisions? what do you do in your data warehouse to make it all work? i would love to hear from you.
ps: don’t forget to check out one-on-one coaching! there’s more in-depth study and hands-on examples of these solutions inside!