From Star Schema to Data Vault

another quick lesson.  these short lessons are not necessarily the whole story, but should be enough to get you started on actually building or modeling your data vault.  as with all quick lessons, always check your work against the source systems and feeds you receive.  look for all the possible missing pieces, watch the scope of the project or task, and try not to make it over-complicated.

to get from a star schema data model (as a data warehouse) to a data vault you might take the following steps.  there are several reasons you might go this direction, ranging from scalability, to auditability, to flexibility.  most folks who want better flexibility find the data vault model easier to work with.

  1. locate the one or more fields in the dimensions that constitute the business keys.
  2. if the fields are “independent” of each other, make them in to separate hubs.  if they are all co-dependent, then they become a single hub.  remember: separate the keys for hierarchical representations.  if the dimesnions don’t carry business keys, you’ll have to go back to the source systems before the transformations were made and find the proper business keys.
  3. take the fact tables foreign keys (from the dimensions) and build link structures.  however: just because all the foreign keys are sitting in a fact table doesn’t necessarily mean they equate 1 for 1 with the links…  if the foreign keys in the fact table were combined for convenience, then they need to be “unwound” or “deconstructed”.  go back to the source systems, look at the relationships and pull the “sets of keys” apart.  relationships in links do not represent the “aggregate view” like star schemas do, they represent the source system relationships.
  4. go to the dimensional data, separate it by “type” or business key dependency.  some dimensional/descriptive data will be placed in a satellite off one hub, while other dimensional data (specifically hierarchical) will be in a different satellite off another hub.  if you build multiple hubs / satellites from a single dimension then you should also build new link tables to represent the hierarchical relationship between the two business keys.
  5. go to the fact data that is not computed (could also be degenerate dimension data), and move it to a satellite hanging off the appropriate link tables.  remember, if the keys were broken into multiple links, then the rest of the descriptive data must be broken into respective satellites hanging off the links.  also remember, we do not put aggregated or computed data attributes in the data vault (at least for now).  a good guide is to review the source model, and look at the incoming transactions.  if they are comprised of multiple keys – then those key sets can stay together in a single link in the data vault.

remember to always check your resulting data vault model against the source systems and against the requirements.  this is just a simple guideline for getting your effort kick started.  also remember to think about “unwinding” all the transformation that is going in to the data in the star schema, because we want raw data to be loaded to the data vault model – so the model will most likely change.

have hints/tips/suggestions?  add a reply or comment.

hope this helps,
dan linstedt

Tags: , , ,

No comments yet.

Leave a Reply