have you found yourself in a situation where a natural world or business key is not available? we all have. what do you do in this case? how do you construct a hub? is there a way in which to still build the data vault? all of these are valid questions, and over the next few months i’ll try to answer them. let’s see though if we can’t offer a few suggestions in this posting.
as we teach in the certification class (http://datavaultalliance.com) you will find that if you don’t have a business key to construct a proper hub, that you need to rely on several other techniques. these techniques are listed in a progessive order below:
- look for multiple fields that construct as much of a unique key as possible – then try to determine if these fields are actually descriptive in nature. if they are character based fields, they have a higher chance of being “fat-fingered” or entered wrong in the application than if they are numeric or date / time fields.
- still can’t find a key? look for a source system numeric identifier – yes, unfortunately a source system sequence number. but try to ensure that the application programmers have “exposed” this un-natural key to the business, if they have (either on a report or on an application screen) then their is a good chance that business users are actually using this field as a business key. if these source systems have not exposed these numbers to the business (they shouldn’t) , then there is a risk that these sequences may be “re-assigned” during full restore of the source database. this would wreak havoc on your data warehouse (data vault or not) unless you have cdc (change data capture) on the source system.
- still can’t find even a numeric key? then you are out of luck… my question to you then & to the business is: how do you identify unique rows? what’s the business process for selecting a single row out of multiples for editing purposes? identify the business process, and try to establish a mechanical method for choosing unique rows. by mechanical i mean: something the machine recognizes, that doesn’t need human intervention/human decision making.
ok – so this is a brief list of things you can try. the risk of using numerical sequences from the source is potentially huge. especially when cross-integrating source data into the same hub.. however, this is somewhat mitigated (same key means different things) by separating satellites by source system feed.
at the end of the day, there are more ideas around business keys, and surrogate keys that are taught in the class. ideas i don’t have time to re-iterate here now, but maybe in the future.
this can cause huge confusion when you go to get data out of the data vault. all i can say here is: document this process well. ensure you’ve documented the gap between “what the business believes is a unique business key” and “what the source systems tell you is the real data”
hope this helps, have thoughts? wish to contribute a solution? love to hear from you.