XREF Tables with History

have you ever wondered what should be / could be and what shouldn’t be an xref (cross-reference/lookup) table?  have you ever sat and pondered the nature of these tables, and how they should interact with the data vault?  have you considered storing history in the cross-references but weren’t sure why or how?  well, i’ll try to answer some of these questions in this post.

what could/should be an xref, and what shouldn’t?

well, this one is mostly simple.  a cross-reference table in the data vault really is a table that stands alone, but more than that – it’s business key is used to describe transactions and other business keys.  in other words an easy hit would be something like: “type code” or “code / description”.  codes / type-codes are basically role playing descriptors for other information.  they describe the status, state, or type of information that is related to another business key.  this classification of information really garners itself well to becomming (and should be) a cross-reference table.  the key by itself has no context within business.  the key is independent.  for instance a type code of “ab” is meaningless in it’s own context.  let’s assume it’s a blood type – meaning “ab+”  well, ok – it still is descriptive in nature.  ab+ type code is absolutely meaningless unless we know the “marked bag of blood” that it is associated to.

in data modeling this is commonly referred to as: “dependent child” – however it’s not a dependent child from this perspective.  why?  because it’s a dependent descriptor…  ab+ really describes the type of blood that it is associated with.   so being descriptive data, it can change over time (especially if someone makes a mistake, and tomorrow needs to change the type code of the blood because of lab analysis results).  keep in mind that “each bag of blood is generally marked with a unique business key or bar-code”.

now, these kinds of data really belong in cross-reference tables.  why? because in order to understand the full description, we should look it up.  we shouldn’t be storing the full description (replicating it) along side the type code in every satellite historical record.  it’s simply non-sensical. it affects performance, multiplies the space needed to store satellite records, etc..  (of course i’m referring to the full description).

but the bottom line is: the type code with it’s description play a role in describing another business key.  again, “ab+” is meaningless unless it is associated with some other context.

what is the nature of these tables, and how should they interact with the data vault?
have you considered storing history in the cross-references but weren’t sure why or how?

the nature of these tables takes two forms: pure or current xref (much like a standared 3nf oltp lookup table – with no history), or a data vault style: hub + satellite.  in the case of the blood product that i was discussing above, i would suggest the following:

xhub_blood_type (blood_type_code pk & bk) ->
xsat_blood_type_desc ([blood_type_cd,blood_type_cd_load_date]pk,
  blood_type_cd_load_end_date, blood_type_code_desc)

** the “x” prefix stands for cross-reference in this case, it makes it clear to the data modelers and people reading the schema **

this allows you to store/capture the history of changes to the blood_type_code.  in this manner, you can tie the correct description in history to the correct data in the satellite.  yes, the satellite in this case maintains the blood_type_code – in what appears to be a foreign key…  wow – i thought there were no foreign keys in satellites… right!  there shouldn’t be!  this one is there only logically – never implemented physically because blood_type_code by itself has no meaning to the business without additional context. 

it’s confusing, i know – but this is the nature of the business.  however – in a table with column compression, or a database that is column based, you might choose to experiment with embedding the full description in the satellite along side the type code.  why?  this is where it really belongs.  also, because column compression would supress the millions of “duplicate row entries” you would get, and help performance – and the column based database performs column compression for you automatically.

hmmm, so we are really truly not breaking the rule of no foreign keys in satellites?

correct – because of the real-application, and where the description is supposed to be/live we would not be breaking that rule at all.

what is the benefit of having xhub, and xsat then if they should be combined/absorbed?

well, when the data is pulled from the vault to the data marts, sometimes the dimesnions and facts want only the latest description – they don’t need and don’t care about the historical description, so rather than “updating thousands or millions of satellite rows” or “running some funky query to determine what the latest blood type is”, it is an easy task to query xhub/xsat combination to get this information.  other data marts want/need the historical description – so again, xhub/xsat makes it easy to get.

have you used cross-references in your vault?  how did they turn out?  what issues did you run into?  let me hear from you….

dan linstedt

Tags: , , , , ,

3 Responses to “XREF Tables with History”

  1. Cesar Vinas 2017/02/22 at 3:30 pm #

    Hi Dan. I have a Person Satellite with a Gender attribute. From source systems the values for this attribute can be: F, M, FEMALE, or MALE. Which of the two following approaches is the correct one:

    1. Store data in Gender as it comes from sources and in the Business Vault or Data Marts standardize the values to FEMALE and MALE only
    2. Create a cross-reference table to map out F to FEMALE and M to MALE, while loading the Person Satellite, transform F to FEMALE and M to MALE using the cross-reference table.

    I’m using Amazon Redshift that does support column compression.

    Cesar Vinas

  2. Dan Linstedt 2017/02/27 at 1:54 pm #

    Hi Ceasar,

    Always store the data as it arrives in order to ensure auditability. Translate the data on the way out to the business vault. This keeps the audit trails in place, and allows you to change the rules when the business changes.

    Hope this helps,

  3. Manuel 2018/12/27 at 6:36 am #

    Hi Dan. I have a “bank account” entity and a “customer” entity, so that a customer can be associated to an account with different roles at once:
    bank account: id_bank_acc, (PK), descriptive1 field, descriptive2 field…
    customer; id_customer (PK), descriptive1 field, descriptive2 field…
    relationship: id_rel (PK), rel_description
    rel_accnt_customer PK: id_bank_accnt (FK to bank account), id_customer(FK to customer), id_rel (FK to relationship), descriptive field 1, …

    In DV, what should be the best solution:
    1) a bank account hub, a customer hub, a relationship reference table, a link between them (PK: id_account,id_customer), and a link satellite with PK: id_account,id_customer,id_rel,timestamp

    2) a bank account hub, a customer hub, a relationship reference table, a link between them and adding id_rel as part of the link PK as a “dependent child”

    3) a bank account hub, a customer hub, a relationship hub, and a link between these hubs.


Leave a Reply