have you ever wondered what should be / could be and what shouldn’t be an xref (cross-reference/lookup) table? have you ever sat and pondered the nature of these tables, and how they should interact with the data vault? have you considered storing history in the cross-references but weren’t sure why or how? well, i’ll try to answer some of these questions in this post.
what could/should be an xref, and what shouldn’t?
well, this one is mostly simple. a cross-reference table in the data vault really is a table that stands alone, but more than that – it’s business key is used to describe transactions and other business keys. in other words an easy hit would be something like: “type code” or “code / description”. codes / type-codes are basically role playing descriptors for other information. they describe the status, state, or type of information that is related to another business key. this classification of information really garners itself well to becomming (and should be) a cross-reference table. the key by itself has no context within business. the key is independent. for instance a type code of “ab” is meaningless in it’s own context. let’s assume it’s a blood type – meaning “ab+” well, ok – it still is descriptive in nature. ab+ type code is absolutely meaningless unless we know the “marked bag of blood” that it is associated to.
in data modeling this is commonly referred to as: “dependent child” – however it’s not a dependent child from this perspective. why? because it’s a dependent descriptor… ab+ really describes the type of blood that it is associated with. so being descriptive data, it can change over time (especially if someone makes a mistake, and tomorrow needs to change the type code of the blood because of lab analysis results). keep in mind that “each bag of blood is generally marked with a unique business key or bar-code”.
now, these kinds of data really belong in cross-reference tables. why? because in order to understand the full description, we should look it up. we shouldn’t be storing the full description (replicating it) along side the type code in every satellite historical record. it’s simply non-sensical. it affects performance, multiplies the space needed to store satellite records, etc.. (of course i’m referring to the full description).
but the bottom line is: the type code with it’s description play a role in describing another business key. again, “ab+” is meaningless unless it is associated with some other context.
what is the nature of these tables, and how should they interact with the data vault?
have you considered storing history in the cross-references but weren’t sure why or how?
the nature of these tables takes two forms: pure or current xref (much like a standared 3nf oltp lookup table – with no history), or a data vault style: hub + satellite. in the case of the blood product that i was discussing above, i would suggest the following:
xhub_blood_type (blood_type_code pk & bk) ->
** the “x” prefix stands for cross-reference in this case, it makes it clear to the data modelers and people reading the schema **
this allows you to store/capture the history of changes to the blood_type_code. in this manner, you can tie the correct description in history to the correct data in the satellite. yes, the satellite in this case maintains the blood_type_code – in what appears to be a foreign key… wow – i thought there were no foreign keys in satellites… right! there shouldn’t be! this one is there only logically – never implemented physically because blood_type_code by itself has no meaning to the business without additional context.
it’s confusing, i know – but this is the nature of the business. however – in a table with column compression, or a database that is column based, you might choose to experiment with embedding the full description in the satellite along side the type code. why? this is where it really belongs. also, because column compression would supress the millions of “duplicate row entries” you would get, and help performance – and the column based database performs column compression for you automatically.
hmmm, so we are really truly not breaking the rule of no foreign keys in satellites?
correct – because of the real-application, and where the description is supposed to be/live we would not be breaking that rule at all.
what is the benefit of having xhub, and xsat then if they should be combined/absorbed?
well, when the data is pulled from the vault to the data marts, sometimes the dimesnions and facts want only the latest description – they don’t need and don’t care about the historical description, so rather than “updating thousands or millions of satellite rows” or “running some funky query to determine what the latest blood type is”, it is an easy task to query xhub/xsat combination to get this information. other data marts want/need the historical description – so again, xhub/xsat makes it easy to get.
have you used cross-references in your vault? how did they turn out? what issues did you run into? let me hear from you….