Sticky Areas, Sweet Solutions in Data Vault Modeling

First off, happy new year to everyone.  This post dives in to two of the most sticky problem areas for Data Vault Modelers, and suggests some basic workable solutions for these.  I encourage you to post your solutions as comments to the bottom of this entry.  The two most troublesome areas to model are: addresses and people.

The problems…

Both of these topics share the same basic problem: lack of “real-world” business keys.  Before you jump on me and tell me you have business keys for these, let me explain…

Issues with addresses…

Addresses consist mostly of non-unique data.  The issue isn’t that the real-world address itself is non-unique, the issue is how the address is represented in computer systems (and entry applications) that make it non-unique.  In general, a physical place usually has a single address, including mail-stops in large corporate buildings.  However, getting a clean copyof all those addresses from all those systems in to a master data set is the tough part.

Yes, if you are going to assign unique business keys to addresses you mustbegin working with master data for them.

So what is the problem?  Unfortunately the problem is in the manner in which the addresses are entered in to the systems.  In other words:

  • RD and ROAD may mean the same thing
  • WY and W and WAY may mean “way”
  • S, ST, STR, STREET all mean the same thing
  • Sometimes, WY and RD are used when STREET is meant (or correct)

You get the idea, the first problem with addresses is: entry of non-standardized data sets.  Basically the address entry screens are free-form, and allow each individual in the company to enter anything they please.  Thus, resulting in duplicate address entries that differ in standard form one way or another – in other words from a data perspective they are unique, but from a human perspective, they are the same address.

So, address standardization is a huge issue that must be corrected before any sort of business key can be applied.

The next issue with addresses is: spelling errors, and address corrections, and don’t forget corrections to latitude and longitude.  So even if you standardize the addresses, they will still have differences across multiple source applications.   Furthermore, I’ve seen incorrect latitude and longitude ratings which complicate matters further.

So you say, well – we will just use an address cleansing / house holding solution.  To that, I say great, wonderful – and if your business runs on addresses, even better.  Hopefully the solution you choose will provide you with two things:

  1. The “as-was” (before cleansing) address, the “as-is” address (post-cleansing).  Without these two address sets you do not have traceability / auditability back to the source
  2. The “assigned address key” – which should stay consistent every time the address cleansing software is run.

The services that I’m aware of that do this kind of thing for international addresses include MelissaData.  I believe the US Postal service even outsources to them (If I’m not mistaken).

Yes, after cleansing, after standardization – then and only then could you think to assign a proper business key.

Issues with People…

Unique Identification of people in a general sense goes against the very grain of my beliefs.  However, in business – as an employee, or a part of a company, it becomes important.  Otherwise, analytics go haywire.  Imagine if I owned a telephone company and couldn’t identify my customers uniquely?  Wow, that would be a mess.

As it turns out, it IS a mess.  It’s a huge mess.  Employees (and sometimes contractors) on the other hand are generally assigned numbers that are associated with all the things they do within the corporation, so they are easier to track – and of course, the business key becomes the identifier that is assigned.

Now with customers, or people visiting a free medical clinic, or people in a hospital, prospects on a web site, the problem is exponentially more difficult.  With data mining (and some basic location information), you can get a pretty decent result.  However keep in mind that running the mining algorithm (if its a neural net instead of a heuristic / statistically driven space) may produce a different “key” for the same inputs based on learning new things.

So, neural networks are good for “possible unique identification” but bad for “absolute business key assignment”.

So putting together a “people HUB” becomes a bad idea.

Ok, these are the issues, where are the solutions?

Really, this is the crux of the problem with both of these “sticky spots”.  Nearly every product, and every service that business engage in can or should or does have an identifiable business key.

Solution #1 is the easiest of all:  Put addresses and “people” records as Satellites across your data model.  Yes, you will have replication of data sets, yes you will increase the number of Satellites, BUT this information will be tied to a real business key that is auditable and has a trail.  AND you can then mine the data, and use same-as links, and standardize/correct the data to produce results for your Business Vault.

This will give you an “as-was (raw auditable)” and “as-is (business)” image of both elements.  It will keep you from getting hung up on trying to assign business keys for “people” identifiers when you don’t have one coming in from the source applications.

Solution #2: is far more difficult.  For Addresses: use latitude, longitude, and altitude (high rise buildings, then office number or mail stop).  MAKE SURE your latitude and longitude and altitude coordinants are correct!  Otherwise, you will have a night mare on your hands IF these change, but the address stays the same.

For people: there is NO good solution.  But if you MUST, then – use full name, address, gender, and if you have it: Social Security number (or something equivalent).    Just remember, if any of these things change, you will be creating a new business key – and you will be forced to mine the data post-loading to try to “paste it together again” in a same-as link structure.

Hope this helps a little,  would love your thoughts on the matter.

Dan Linstedt

Tags: , , , , , , , ,

12 Responses to “Sticky Areas, Sweet Solutions in Data Vault Modeling”

  1. roland.bouman 2013/01/07 at 1:09 pm #

    Dan,

    interesting points! I’m still a bit confused though. You wrote:

    “So putting together a “people HUB” becomes a bad idea.”

    and

    “Solution #1 is the easiest of all: Put addresses and “people” records as Satellites across your data model.”

    and

    “you can then […] use same-as links,”

    Sticking to your example of “customers, or people visiting a free medical clinic, or people in a hospital, prospects on a web site”, are you suggesting each of these would get a separate hub, and each a satellite to hold name, gender and birthdate data? And then, to add “same-as” links between these hubs to assert identity?

    If yes, then what key should be used for those hubs? It seems to me the fundamental problem – choosing the key – is irrespective of the number of hubs. In addition, I think it’s safe to say the matching process to discover the same-as links is likely to result in some mismatches. Would it be ok to delete those link records? Or should same-as links also never be deleted?

    Also, with regard to addresses – it seems to me that even if the data was collected correctly in the source system, one can never guarantee immutability of addresses. For instance, what was once Berlin was divided in East- and West Berlin, and then, in due time, re-joined again. Still, it seems likely that for, say, a package delivery service, the address really is whatever the business uses to identify a package destination. Wouldn’t this suggest that sometimes, it is possible to have a mutable business key? Or would it in your opinion make more sense to say that sometimes, a business key is not the appropriate means for identification, and that a surrogate key serves the EDW better?

    kind regards,

    Roland

  2. dlinstedt 2013/01/07 at 1:28 pm #

    Hi Roland,

    What I am suggesting is:

    a) Hubs for People is a difficult proposition, unless you have a master data source, or a source application that already attaches a unique business key, or you have a people “cleansing” process that puts the information together in a source system / system of record.
    b) otherwise, hubs for people should be avoided. instead, opt for putting your people information in as Satellite data

    Then, use your cleansing/merging software to produce your master data going from your raw DV to your business DV.

    I am not suggesting that each of these get their own separate hub and satellite. I am suggesting that you attach the people to a different, already identified business key. Let the “people” be attribute descriptors rather than based on their own business key.

    I am also suggesting that putting together a business key for people is very difficult if all the sources don’t share a consistent key. Not to mention the simple act of identifying people is just plain difficult at best.

    Yes, it is safe to say that the process that matches same-as may produce invalid results. In fact, what you need to record in those same-as links is the confidence and strength coefficients of the resulting match. That way, you can delete the ones that are truly not good. This same-as link is a part of the logical Business Vault, therefore, it is to be managed as such (post business rule results).

    Yes, it is always possible to have mutable business keys – this is the nature of business keys. To say otherwise would be nuts. Mutable business keys are a fact of life, hence the definition of hubs: business keys have a low propensity to change… (part of the standards doc). Anyhow, the idea is that addresses generally are prone to far more errors and far more duplication than most other “business keys”, so constructing a business key for address is just plain hard to begin with.

    Now, if your business (or your customers’ business) makes money off of addresses (like postal services), then they better have a source system application that cleans them up BEFORE they reach the data warehouse. In other words, a master-data system for addresses *should* be priority #1 for that particular business, and assigning business keys *at that point, in that master source* should be part of their business processes.

    No, The discussion does not make the point that surrogates are better for addresses, in fact the opposite is what I am suggesting. I NEVER try to construct “surrogate made up keys” for addresses in Data Vaults. It just is not a good idea, and will come back to bite you later in the implementation / life cycle of the EDW. What I am suggesting is: as just stated either find a way to get a business key FROM the source, or leave it alone, and post-process it on the way to the business Data vault.

    Hope this helps,
    Dan Linstedt

  3. roland.bouman 2013/01/07 at 1:53 pm #

    Dan, thanks for the reply – I really appreciate the insights.

    Roland

  4. dlinstedt 2013/01/07 at 2:41 pm #

    You are welcome. I appreciate the questions 🙂

    Dan Linstedt

  5. Raphael Klebanov 2013/01/08 at 12:42 pm #

    Thank you Dan and Roland for an interesting discussion. Your second piece, Dan, clarifies couple of notes I had.
    Anyways, my initial approach is to make the best effort to create a “master” list of people or addresses that includes distinct business keys that I gather from various sources. This is done on pre-DV staging area. Then, I’d use this list to create a “master hub” with multiple satellites in the vault, ideally, will use of atomic level info such as SSN or 3d coordinates. But since it almost never available, I’d use the best to identify the unique keys. Of course, there is always a risk of adding another source with “unconventional” bus keys structures later; the decision has to be made accordingly. Having such “master” hub, I found it easier to use it (or views) along with its satellites to build a presentation layer down the road. IMO, ease of build/rebuild of a presentation layer worth some work spent at the foundation layer (DV).
    Thanks,
    –Raphael

  6. dlinstedt 2013/01/08 at 12:49 pm #

    Remember: ANY time you change data UPSTREAM of the EDW / DV, it had better become a system of record, otherwise auditability is compromised. Master Data or Master Systems should be a source to the EDW anyhow.

    Just remember, to always keep the PRE/before modified and POST/after modified information, so that audits can see and trace how this data got transformed.

    Be warned, applying business rules on the way in to the raw data vault is neither advocated, nor condoned.

    Cheers,
    Dan Linstedt

  7. DM Unseen 2013/01/09 at 4:50 am #

    I will note the following,

    In the Netherlands we now have the BAG (basic registration addresses and buildings) which is a government sponsosered basic registration service that contains ALL addresses and buildings and assigns them a public ‘immutable’ technical id. It is publicly accessible data service, so there is currently no excuses for NOT having a dutch address HUB 😉

    For persons, this will always be AT BEST a hub with a context of say country residents (you can use the country’s SSN for that).

  8. dlinstedt 2013/01/09 at 6:23 am #

    Hi Martijn,

    Thank-you for the insights, but please remember the following points:

    a) when I build Data Vaults, I try to set standards for international data storage. IF your Data Vault is country specific, then the approach you speak of may work. If however your Data Vault houses international data, then address keys from a specific country may then constitute their own Hubs or, as I have mentioned, the only international approach is to put all address information in to Satellites, and key them on the way in to Business Vault tables.
    b) With persons, again, the same thing applies – some countries don’t have unique identifiers or even SSN.

    My point is: when building Data Vaults or Enterprise Data Warehouses it is extremely important to think globally (unless you know for a fact that the data set will never house global data). Too many people forget this point, and when that happens, it leads to re-engineering when the company DOES go global.

    Cheers,
    Dan Linstedt

  9. DM Unseen 2013/01/09 at 10:32 am #

    Hi Dan,

    Global vs Local Data Vaults,

    This is indeed something to be aware of, although a lot of DV’s over here are mainly local and cannot/will not go global.

    The problem is that even in global Data Vaults you might want to use/require country specific knowledge to gain DQ in your DV. This means that both the global and local perspective need to be available/maintained in your DV. This is important for most Dutch DV’s I know of. It is typically European to always honor all local data rulings (and that is something basic generalization destroys), which IMO is different than US thinking about data (models).

    Some of these issues can be resolved in a Data vault using specific entity generalization and specialization rules, but it is something that requires attention, knowledge and modeling skills.

  10. dlinstedt 2013/01/09 at 1:20 pm #

    Excellent, we are in agreement on these points, thanks for the clarifications.

  11. Jonathan Shaltz 2017/02/13 at 10:33 am #

    Forgive me if I’m restating what may be obvious, but regarding solution #1, does that mean that a sales database (for example) would have no Customers hub, but rather the Orders hub would have a Customers satellite, with fields such as CustomerID, FirstName, LastName, StreetAddress, etc.? I’d be concerned about the size of this table, as each order would have an entry in this table; if typical customers place many orders, as in an industrial setting, your SatCustomers table could be 1000x larger than it needs to be.

    If my understanding is correct, how would you record customers who do not yet have any orders? Would you allow NULL in the satellite’s HubKey field, or use a dummy Hub record representing “no orders yet,” perhaps?

  12. Dan Linstedt 2017/02/27 at 2:00 pm #

    Not quite the right understanding, and not quite the right nature of the post.

    The real problem? is a business problem.
    Data Quality, synchronization of addresses should be addressed on the SOURCE systems. Especially if addresses are that important to your business. They should “never” be addressed in the warehouse. Address cleansing is a BAND-AID to a business problem that needs to be fixed up-stream.

    That said, if you have multiple systems delivering multiple addresses, then they need to be represented in the warehouse. How do you do this if the addresses themselves HAVE NO REAL BUSINESS KEY??? The answer lies in the blog post.

    That said, the same question goes for identifying people, and corporations: HOW do you identify (unqiquely and consistently over time) a SINGLE person, or corporation? Especially when their name changes? That, also is part of the intent of the blog post.

    The blog post is not about size, or scalability, and if you are struggling with that aspect of the data or the model, then you are focusing on the wrong problem. The focus really needs to be turned back to the organization to solve the business problem.

    The good news is that my friend: Nols Ebersohn will be presenting this information at http://wwdvc.com 2017 especially when he talks about Data As An Asset combined with Data Quality.

    Furthermore, 1000x is quite a multiplier. If your business really has this problem, they are hemorrhaging money on the source system, and I would have to ask the CFO why they haven’t addressed it yet in the source systems?

    Technically speaking: make your Raw DV insert only, and turn on compression. Should solve this problem with 70% reduction in the “duplication” of data as it were.

    Hope this helps,
    Dan Linstedt

Leave a Reply

*