In search of Better Business Keys…

i’m always on the lookout for ways to improve the understandability and readability of the data vault, especially when it comes to solving business user problems or issues.  as you know, the data vault model relies on the best possible business keys being chosen for hubs.  knowing your business data, understanding the business processes, and linking the two to the data warehouse, all happens through business keys.  the more consistent the business keys are (across lines of business and through the business processes), the better the analytics will be, and the better the data vault model will be.  in this entry i will be diving in a little bit to the technical aspects of unique identification.  i will discuss md5, hashing in general, uniqueness, and sequence numbers.

get familiar with hashing…

let me just start by saying: if you are unfamiliar with hashing, or don’t know what an md5, sha-1, or crc is, you should probably read up on it.  there are literally hundreds of sources for this information, but wikipedia has a great set of definitions and explanations.  although for ease of simplicity, i will offer this over simplified definition (just for frame of reference and context).

md5, and sha-1, and a few other algorithms are considered to be cryptographic hashes.  that is to say: they encrypt data, and then provide a mostly unique finger-print or id for the data set that it ran through it’s mathematical algorithm.  the rfc’s for these algorithms generally state that these functions have a probability of producing something called a collision.  the collision is when you have two different data values go through the mathematical algorithm, and produce the same result. 

crc is of a slightly different class, in that it’s not cryptographically safe.  it is in a class called a simple hash, and is known as a cyclical redundancy checksum.

the business of business keys…

business keys are supposed to have meaning to the business user.  in a perfect world, they really should be decipherable by individuals who live and work in that industry.  for instance, vin (vehicle identification number) numbers are really well known to those who work in the auto industry.  some individuals are so adept at reading these numbers they can tell you what some of the sequences mean – define the number for you.  for all the things the auto industry may not get right, they have one thing right: a consistent business key for their main product.  not only that, but it’s standardized around the world; it’s used in all manufacturers plants for “most” motorized vehicles, including motorcycles, cars, and busses.

this key doesn’t change when it goes from the sales system to the contracts system, it doesn’t change when it’s transferred from the contracts system to the manufacturing system, or any other system in the business.  it stays consistent once assigned.  this provides for 100% traceability for the life of the vehicle from inception to even “parts selling” in a junk yard somewhere.

we should be so lucky as to have business keys for all our business data, that we would be able to finger print it across all the systems.  as it is today, these keys change all the time, but that’s another blog entry for later this week.  let’s get back to hashing and business keys, and md5.

why can’t i use md5 as a primary key in the data vault/data warehouse?

i get this question all the time.  let me first say this: md5 is not infallable, it is not garaunteed to produce uniqueness – it is garaunteed to fingerprint and provide an encrypted set length of data for a given string of bits.  in other words, produce the same hash or number/result for the same input every time.  if you are looking for uniqueness in a primary key, you may get lucky – that is until two things happen that increase the chances of collisions.

1) if the bit string you put in to the md5 is shorter than 128 bits (the length of the result of the hash function), you increase the chances of collision by multiple factors.  in fact, a more simple way to look at it is: the shorter the input, the greater the likelyhood of producing the same number for two different input values.

2) the other possibility is: scale/size/scope of your edw.  the more rows you put in to your data warehouse, the greater the chance of collision, again: see #1 for the initial principle.

for instance: suppose you have a 15 character string as a business key, you’ve already increased your chances  of duplicates with rule #1 above.  then, add to that: 110 million rows, you’ve now increased your chances of duplicate keys for different values by rule #2.  the chances of md5(“ab”) = md5(“ba”) can be fairly high with large numbers of rows being processed, these both are very short strings pushed in to the algorithm.

if you have a finite set of rows that you know will never grow beyond 15 million or so, you might get away with no collisions and a great/speedy way of changing alpha-numeric business keys into fast indexable primary keys.   or if you have a very large business key (say 500 characters or longer) maybe an image, or a blob, then the massive length decreases the chances of duplicates being produced by md5 (more unique bits to hash).  under these circumstances making md5 results a business key might just work.

right then.  i’m off to set md5 as my business keys…

woaah, slow down, wait a minute, you’re moving too fast. 

remember this: the main purpose of a business key is to never change (legally) throughout the life of the data that it represents.

tying the business key to the business process, to the data flow in the business process, and also to the physical model will nearly always lead to success in any business intelligence or operational system.  it allows full traceability back to the source, and even across the sources.  otherwise horizontal pictures of the business cannot be created (but i ramble, and this too is for another blog entry coming up).

do not simply “decide” to set md5 as your business key and run off and implement it.  consider the implications first!

what is the issue we are trying to solve here?

the issue is really more of a two part question, one technical, one business.

technical: how do i “detect/lookup/identify” data quickly without all the fuss of “caching lookup tables” or joining to the data in the database?

business: how do i provide a value-added key that has meaning across the business, yet can be used in a technical solution?

ok – there’s one more hidden question for both parts: how can i keep the key from changing once it’s created?  this is the biggest question of them all.  if you’re using an md5 against a “name” of some sort, then then imagine what happens when the “name” changes…  your algorithm for “identifying” data in  the edw, creates a new hash key, and can’t trace it back or tie it back to the old data that is still marked as current in your edw!!  again, you should only create md5 keys in limited data sets, and large enough keys that won’t contain collisions.

now, if you’re lucky enough to capture the name change and your system can be fed an old-name and the new-name, you can use a hierarchical link (many-to-many hierarchy) table to relate the old information.  we call this a same-as-link in data vault speak.  now, i’m going to make a very strong statement and take a stand here – because we don’t live in a perfect world, and keeping business keys that “never change” is usually a far-fetched reality.

cdc is absolutely vital to the future of data warehousing and bi.  cdc will make a set of changing business keys become a seemless operation for horizontal integration of data in the next-gen data warehouse.  cdc must be implemented on all future operational systems.

so is there a time when i can use md5 as a business key in my data vault / data warehouse?

strangely enough, the answer (as it turns out) is yes.  i would suggest the following situations are valid and applicable to potentially using md5 as a replacement for surrogates in the data vault:  1) operational data vault,  2) business data vault – where you may have control over the raw same-as links inside the raw data vault, and you are moving the data in to a master data system.

wait a minute, did you just say that a business data vault is a master data system?  yes, i did, and if it’s built properly, it can be the foundations for your master data initiative (and a pretty good one at that).  but, more on business data vaults in yet another future blog post.

these are the two situations.  operational data vaults mean that your application sits on top of the data vault warehouse directly, and therefore gives you complete control over maintiaining/establishing business keys that should never change. this means md5 will work consistently and repeatably for you over time.  operational data vaults also give you the option of building cdc directly in to the application (thus allowing business keys to change over time – but be wary of this option, it leads to other problems in the future).

business data vaults rely on cleansed/cleaned and integrated data.  they rely on master-business keys being chosen for the hubs.  that being the case, md5 should generate mostly unique surrogate keys across the board.  the other piece, (if you’re lucky enough to have it or create it) is: cdc or audit-trail feeds are already stored inside the raw data vault and same-as links.  this makes it easy to choose a consistent master key for the business data vault.

just remember the final two tennants: length of the business keys make a difference, longer is better.  number of business keys also makes a difference – smaller number of keys is better (less chances of collision).

if you account for both of these, or develop a unique mitigation strategy for the second (so it can scale), then you’re home free.

by the way, some research is being done in global unique hash (no, it’s not guid – guid is defined differently, don’t mistake the two).  if they ever get to “perfect hash algorithm” that works, and make it public knowledge, i’ll be jumping on that band-wagon to test it out.

what are your experiences with business keys, hashes, and collisions?  what do you do in your data warehouse to make it all work?  i would love to hear from you.

dan linstedt
ps: don’t forget to check out one-on-one coaching!  there’s more in-depth study and hands-on examples of these solutions inside!

Tags: , , , , ,

10 Responses to “In search of Better Business Keys…”

  1. Razvan 2010/10/07 at 8:29 am #

    Hi Dan,
    I am trying to build a Data Vault, and I have a table in my operational DB called ‘Person’, that is holding information about people; I am having trouble in indetifying a business key for the HUB Person, since I have no column that uniquely identifies a person besides the PersonID -> the primary key.
    Any advice?
    Thank you,

  2. sverzel 2011/05/03 at 3:37 am #


    Name, country and/or social security number should suffice. If you find you do not have this information, the source appears to be fundamentally lacking in detail. Hence, you cannot distinguish several records on any other information than a technically generated key — that key is very weak and cannot allow you to describe which person each record references.

    You might want to look into getting your hands on more information from the source that allows you to describe the Person.

    Stefan Verzel

  3. Ann Poindexter 2014/06/19 at 12:42 pm #

    We have 3rd party software that has 2 modules that we are putting into the Data Vault. We are tring to figure out our Hub approach on customer.

    One system (XYZ Accounting) uses a customer key of 3 fields (ex: SystemCd, ModuleCd, CustNbr)
    …and the other module(XYZ Transactions) uses a customer key of 1 field (CustomerNbr)
    ….How should that be modeled in a Hub?

    We only have one company, so we only will have ONE System Cd.
    We own 2 Modules, but only one of them requires that info their key.
    Both represent customers.
    One table has rows of data that represents rows of the other (a subset).

  4. Mike 2014/10/23 at 10:43 am #

    Hi Dan,

    I have a situation where my natural key coming from our source system is a combination of a “customer number” field and a “customer number transfered to” field, however the latter field is actually used a recursive join to the customer number. A customer can also be tranfered on more than one occasion. How do I manage this using a master key approach, even at the Business Data Vault layer seeing as an MD5 generated from the 2 fields is not sufficient here?

    Thanks for you help!

  5. Dan Linstedt 2014/10/24 at 7:17 am #

    Hi Mike,

    Do you have a data set example? What is causing the transfer field to be filled in? I need to understand a little more about the situation before I can help with an answer.

    Dan Linstedt

  6. Mike 2014/10/24 at 2:43 pm #

    Hi Dan,

    Thanks for your reply. The transfer field is filled at the operational level. This situation pertains to a Member having a card, the card number being the customer number mentioned here above. A member can then receive another card (plastic) with a 2nd number from another enrollement source. The member accumulates points on both these cards.

    At some point, the member decides to call the help desk and merge his points onto one card and cancels the other one (hence the transfer)

    So essentially we have the following in the transaction system:

    Before Transfer:
    Customer Number Transfer Customer Number Name
    123456789 NULL Dan Lindsted
    987654321 NULL Dan Lindsted

    After Transfer:
    Customer Number Transfer Customer Number Name
    123456789 987654321 Dan Lindsted
    987654321 NULL Dan Lindsted

    We want to build a master key for Dan Lindsted. How can I manage that in hub, or do I need to manage it prior to getting to the hub?

    Hope I’m clear

    Thanks again,

  7. Dan Linstedt 2014/10/24 at 3:08 pm #

    Hi Mike,

    Thanks for clarifying. In this case, I would model it this way:

    Hub_Customer (Hub_Cust_Sequence, Hub_Cust_Load_Date, Hub_Cust_Rec_Source, Cust_Number)
    Sat_Customer (Hub_Cust_Sequence, Sat_Load_Date, Sat_Load_End_Date, Sat_Rec_Source, Customer_Name, )

    SALink_Transfer (SALink_Sequence, SALink_Load_date, SALink_Rec_Source, Master_Cust_Sequence, Transfer_Cust_Sequence)
    Sat_Effectivity_Transfer (SALink_Sequence, Sat_Load_Date, Sat_Load_End_Date, Sat_Rec_Source, Date_Customer_Transferred, Points_Transferred)

    This would accommodate a hierarchical link structure to take care of the “1 level” hierarchy, of transferring points from one card to another, both belonging to the same customer. This would also accommodate customer name changes (if that happens).

    The “MASTER” key is indicated in the Same-As Link. This concept is covered in detail in my BootCamp training, and in my book: Super Charge Your Data Warehouse.

    Hope this helps,
    Dan L

  8. Mike 2014/10/24 at 3:34 pm #

    Hi Dan,

    I’m still waiting for your book to come in but I’ll check it out in the electronic version.

    Thanks a bunch for your time and help. I’ll try out this solution and keep you posted!

    Best Regards,

    P.S. I’ll be looking into your Boot Camp training!

  9. Mike 2014/10/28 at 9:02 am #

    Hi Dan,

    Just tested the solution and it works beautifully!

    Thanks again for you help, it is much appreciated!


  10. Dan Linstedt 2014/11/06 at 8:31 am #

    Glad to hear you got it all working, yes the new book will be next year some time (depends on Morgan Kaufman publishers)


Leave a Reply