Q&A About Hubs

i receive many questions via e-mail.  i will begin answering them in the coaching area for my students.   i’ve decided to post this q&a from a student of an upcoming certification class here.   as i said yesterday, i am not currently involved in certification.

question:

my question is about the definition of the hubs.   i read your article on tdan.com “the datavault series” where you say:

“since the hubs are a list of business keys it is important to keep them together with any surrogate keys (if surrogates are available). upon examination of the model we find the following business key/surrogate key groupings (the examination included unique indexes and a data query)”

in the given example there is a category-table with businesskey categoryname and surrogate-source-key categoryid. you then create the category_hub:

[code]

create table hub_category (
categoryid int not null,
categoryname nvarchar(15) not null,
load_date datetime not null,
record_source nvarchar(12) not null,
primary key (categoryid)
)

create unique index hub_category_i1 on
hub_category (categoryname)

[/code]

question 1: is the categoryid in the hub the same id as from the source-system?

i assume it is a separate datavault-surrogate that has nothing to do with the source-surrogate.

to load the hubs: select a distinct list of business keys with their surrogate keys, where the keys do not already exist in the hub.

[code]

create view v_ins_hub_categories as
select distinct a.categoryid,
getdate() load_date,
‘northwind’ record_source
from northwind..[categories] a with (nolock)
where not exists
(select * from hub_categories with (nolock))

[/code]

question 2: why do you only select the source-surrogate-key to determine the new business keys in the source system?

should’t it be:

[code]
select distinct
a.categoryid,
a.categoryname,
getdate() load_date,
‘northwind’ record_source
from northwind..[categories] a with (nolock)
where not exists
(select * from hub_categories with (nolock))
[/code]

or maybe even:

[code]

select distinct
a.categoryname,
getdate() load_date,
‘northwind’ record_source
from northwind..[categories] a with (nolock)
where not exists
(select * from hub_categories with (nolock))

[/code]

i am a little confused here, so i really hope you can help.

answer to question 1:

in this specific example, categoryid is the same as the source system id.  however, in most cases, it should not be.  in typical data vault models, you should always select the business key (which should be category name) rather than the surrogate.  however, northwind has problems that go beyond the standard data model.  the following problems are boiling to the top, and end up becoming specific implementation rules:

1) there are no, and never were any, business users to ask/answer questions about business keys.  the model was built by sqlserver engineers, not business users.

2) there’s no integrity in the data set from version to version of release.  meaning they can and sometimes do, use the same sequence number to represent different category names.  the inverse is also true, they can and sometimes do change the assignment of sequences for the same category name.

3) when building a data vault for the northwind example, i was only consolidation a single source data model.  typically when you build a data vault, you will use many more data models (2, 3, or even 5 or 10) to integrate.  that said, using categoryid as a hub business key is a bad idea because it doesn’t span the business.

4) the article was written in 2001.  at that time i was in a hurry, and didn’t spend a lot of time fixing the code before publishing (which i should have done).

the point is this: the hub should not look this way at all.  the hub’s sequence key should be completely separate and independent of the source model, it should never use the source data models’ surrogate key.  the second point is: the hubs’ business key should be chosen independently of the source systems surrogate key (if at all possible, it should use the natural world key).   the bottom line is: the source model (northwind) is severely broken, and because of the nature of missing requirements, not enough data to profile, and bad business keys, the dv i put together also was broken.

answer to question 2:

i’ve already provided some insight as to why i used categoryid and not categoryname from the source model above.  if possible, i would have originally chosen categoryname as the business key, however from database release to database release, the name column is not necessarily constant, nor is it always populated, nor is it unique.  these factors make it difficult (if not impossible) to use categoryname by itself as a business key.  not to mention the other reason: there is no consistent manner to choosing business keys across the model.  not all tables have a natural business key.  so in that case, and in the name of consistency, i chose to use the surrogate id across the entire model of northwind.  don’t forget, i was integrating only 1 source system data model so it was a no-brainer.

when you integrate more than 1 source system data model, you must take the time to consider all of these questions, and go to get accurate answers from the business.  at the end of the day – the data vault that you produce will or won’t be quality based on the amount of up-front work you do to properly determine the business keys.

hope this helps,
dan linstedt
ps: all the answers for data vault modeling (only modeling) are in my book: super charge your data warehouse, available on amazon.com

Tags: , , , ,

Trackbacks/Pingbacks

  1. Tweets that mention Q&A About Hubs -- Topsy.com - 2010/11/10

    […] This post was mentioned on Twitter by BI real, Daniel Linstedt. Daniel Linstedt said: Q&A About Hubs: A quick Q&A about the choices I made in the Northwind Data Model… http://goo.gl/fb/JIuD3 […]

Leave a Reply

*