i’ve written about this danger many times, in fact, you can find some great postings and information about this subject on the forums on linkedin.com “data vault discussions” but that’s beside the point. there are no exceptions to this rule (as i see it). foreign keys simply should never exist within a satellite. to do so severely damages the potential flexibility of the data vault model. let’s chat about some of the mathematics of maintenance behind this idea.
when we look at foreign keys, it is easy to think: “gee, it describes what the parent key seems to be at the moment” while at the same time it represents a link, a join, a relationship to other key or pertinent data. it’s easy to fall down the slippery slope of simply deciding to add foreign keys to the satellite structures. there are all kinds of arguments as to “why” this can or should be done – but no matter what the argument – the fact remains: foreign keys in 3nf tables represent a relationship to other data.
in a data warehouse these foreign keyed relationships change over time and maintaining that history is essential for auditability and tracability. however, this reasoning alone doesn’t negate the “desire” to put fk’s into the satellite.
so what exactly is this driving force that keeps me from saying “ok”?
let’s discuss! by the way, please remember there are 10 years of r&d behind these rules and standards, and i hope that i’ve tested the architecture in many different situations well enough to understand the most common outcomes of architecture changes. there is always a chance (a pretty good chance) that i missed something along the way, so if you think of something, go ahead and comment as a reply – and i’ll let you know my experience. now, on to the question at hand….
when we think about data models we often conveniently forget about complexity ratings. that is: complexity ratings of the data model itself in relation to: etl /elt loading processes, bi query processes, and of course maintenance of the data model.
more-over a data warehouse isn’t a simple complexity rating!
why? because a data warehouse stores data over time. so that means, any time there is a structural change, it impacts all future data (yet to arrive), and all past data already stored in the edw. it is because of this “past” data set that we need to consider the complexity rating in conjunction with a multiplier effect. in other words, the complexity rating of a data model without historical data might be “3” on a scale of 1 to 10, where the complexity rating (for absorbing changes) to an edw (because of history) would be a 3^2 or 3^3 – all of the sudden the complexity rating is off the charts. ie: a 9 or a 27 on a scale of 1 to 10….
this is because the number of items impacted by the change, double or triple – then there is the “data” – what to do about the data that is already stored? some people fall down this slope and say: “that’s easy, just add the new foriegn key to the satellite and make it optional”. well, optional foreign keys (even in oltp tables without history) increase the complexity rating on a factorial basis.
to understand this, we look at the complexity ratings used to measure level-of-effort in maintaining code. in a programmattic sense, what happens to the complexity rating of a procedure/function when a “decision” or condition is introducted? it makes the code inside the condition, optional based on some data driven element. the complexity rating increases. there are tons and tons of formulas to describe these effects in costs per defect, costs per option/decision, performance per decision or “branch”, etc….
the satellites are the same… don’t introduce optional decision making to a satllite!!! why? because the foreign key structure in the satellite represented by the data of yesterday will not equal the foreign key structure in the satellite represented by the data of today. this forces bi code to become far more complex than it needs to be to account for time-line breaks.
but it get’s worse. all of the sudden the metadata (meaning/definition) and the “grain” of the data in the satellite is called in to question. for instance, if you had a satellite off customer today, and it had a foreign key to salesperson – and that fk was optional. what does the customer data mean? well – if it has a foreign key filled in, then it means a salesperson sold that customer? ok, but what happens if another foreign key is added to the satellite: salesterritory. now, what does the “old data” mean if it doesn’t have a salesterritory? does it mean we never got the data? does it mean we got the data on the feed but ignored it? does it mean that no-one entered the salesterritory in today’s data?
what exactly does the fk represent? does the fk mean the customer is in a particular sales territory or does it mean it’s the sales territory of the salesperson, but only when the salesperson fk is filled in?
as you can see, the complexity of deciphering all of these questions begins to raise the maintenance cost of the data vault model… something we all want to avoid. if we put the foreign keys in links, then “yesterday’s link a” has customer and salesperson, a new link (if the data truly means: a salesperson sold this customer in this territory”) would contain: customer, salesperson, and salesterritory (from this date forward).
there is a distinct difference in segregating the relationships out, compared to “storing them with the satellites”. what you are effectively doing (when you put an fk in a satellite) is “overloading” the definition of the satellite (to use a coding term).
when you overload – you multiply the technical and business definitions, along with applying a factor to the complexity of understanding the information…. this is a bad bad practice.
do not put foreign keys in your satellites!!! if you do, you do not have a data vault! furthermore, you will not inherit the benefits of the data vault going forward – and it will eventually cause you to re-design the whole edw from the ground up because maintenance costs will spiral out of control.
there are many other technical reasons, but this – this is the major business reason.
hope this helps clear the air,