first off, happy new year to everyone. this post dives in to two of the most sticky problem areas for data vault modelers, and suggests some basic workable solutions for these. i encourage you to post your solutions as comments to the bottom of this entry. the two most troublesome areas to model are: addresses and people.
both of these topics share the same basic problem: lack of “real-world” business keys. before you jump on me and tell me you have business keys for these, let me explain…
issues with addresses…
addresses consist mostly of non-unique data. the issue isn’t that the real-world address itself is non-unique, the issue is how the address is represented in computer systems (and entry applications) that make it non-unique. in general, a physical place usually has a single address, including mail-stops in large corporate buildings. however, getting a clean copyof all those addresses from all those systems in to a master data set is the tough part.
yes, if you are going to assign unique business keys to addresses you mustbegin working with master data for them.
so what is the problem? unfortunately the problem is in the manner in which the addresses are entered in to the systems. in other words:
- rd and road may mean the same thing
- wy and w and way may mean “way”
- s, st, str, street all mean the same thing
- sometimes, wy and rd are used when street is meant (or correct)
you get the idea, the first problem with addresses is: entry of non-standardized data sets. basically the address entry screens are free-form, and allow each individual in the company to enter anything they please. thus, resulting in duplicate address entries that differ in standard form one way or another – in other words from a data perspective they are unique, but from a human perspective, they are the same address.
so, address standardization is a huge issue that must be corrected before any sort of business key can be applied.
the next issue with addresses is: spelling errors, and address corrections, and don’t forget corrections to latitude and longitude. so even if you standardize the addresses, they will still have differences across multiple source applications. furthermore, i’ve seen incorrect latitude and longitude ratings which complicate matters further.
so you say, well – we will just use an address cleansing / house holding solution. to that, i say great, wonderful – and if your business runs on addresses, even better. hopefully the solution you choose will provide you with two things:
- the “as-was” (before cleansing) address, the “as-is” address (post-cleansing). without these two address sets you do not have traceability / auditability back to the source
- the “assigned address key” – which should stay consistent every time the address cleansing software is run.
the services that i’m aware of that do this kind of thing for international addresses include melissadata. i believe the us postal service even outsources to them (if i’m not mistaken).
yes, after cleansing, after standardization – then and only then could you think to assign a proper business key.
issues with people…
unique identification of people in a general sense goes against the very grain of my beliefs. however, in business – as an employee, or a part of a company, it becomes important. otherwise, analytics go haywire. imagine if i owned a telephone company and couldn’t identify my customers uniquely? wow, that would be a mess.
as it turns out, it is a mess. it’s a huge mess. employees (and sometimes contractors) on the other hand are generally assigned numbers that are associated with all the things they do within the corporation, so they are easier to track – and of course, the business key becomes the identifier that is assigned.
now with customers, or people visiting a free medical clinic, or people in a hospital, prospects on a web site, the problem is exponentially more difficult. with data mining (and some basic location information), you can get a pretty decent result. however keep in mind that running the mining algorithm (if its a neural net instead of a heuristic / statistically driven space) may produce a different “key” for the same inputs based on learning new things.
so, neural networks are good for “possible unique identification” but bad for “absolute business key assignment”.
so putting together a “people hub” becomes a bad idea.
ok, these are the issues, where are the solutions?
really, this is the crux of the problem with both of these “sticky spots”. nearly every product, and every service that business engage in can or should or does have an identifiable business key.
solution #1 is the easiest of all: put addresses and “people” records as satellites across your data model. yes, you will have replication of data sets, yes you will increase the number of satellites, but this information will be tied to a real business key that is auditable and has a trail. and you can then mine the data, and use same-as links, and standardize/correct the data to produce results for your business vault.
this will give you an “as-was (raw auditable)” and “as-is (business)” image of both elements. it will keep you from getting hung up on trying to assign business keys for “people” identifiers when you don’t have one coming in from the source applications.
solution #2: is far more difficult. for addresses: use latitude, longitude, and altitude (high rise buildings, then office number or mail stop). make sure your latitude and longitude and altitude coordinants are correct! otherwise, you will have a night mare on your hands if these change, but the address stays the same.
for people: there is no good solution. but if you must, then – use full name, address, gender, and if you have it: social security number (or something equivalent). just remember, if any of these things change, you will be creating a new business key – and you will be forced to mine the data post-loading to try to “paste it together again” in a same-as link structure.
hope this helps a little, would love your thoughts on the matter.