i recently received an email from a future data vault consultant, he had some questions that he’d like answered – so i will do my best to answer these questions here and now. if you have your own sets of questions, please let me know what i can do to answer them. i feel that some questions are better blogged for free – and other questions may be best (due to time needed to correctly answer them) if left for subscriptions (yes, subscriptions for data vault support is coming soon).
**please note** please don’t mistake my enthusiasm for sarcasm here, i can be a bit blunt sometimes – i don’t mean any offense by this posting.
1. question 1: old articles
the tdan.com articles – there are 5 total. they were written in 2001/2002 – many years ago. they also reference a company i worked for many years ago – which is now defunct (bought out really) and has been changed to another company. apparently the articles on tdan.com are still driving interest after all these years; it’s good to see. however, for those struggling to find data vault resources – linkedin.com “data vault discussions” – and over 800 subscribers to the free discussion group (although the postings are a bit “slow” on the uptake). still, a lot of good information there. also, as you now have discovered – you can read about data vault here!
yes, this blog is enabled with a mobile skin – so you can use ipod, iphone, androids, and others to read it. you can also subscribe via rss.
2. question 2: i really buy into the dv architecture approach, though it seems excessive to break every business/surrogate key out into a separate link table. otherwise simple tables are split into two: one for content and one for fk relationships. i can see this making sense for high-volatility data, but not for financial transactions (for example) that are generally static.
this individual has not yet been through the certification training. in the training class we explain a notion called a transactional link (for example: a financial transaction) that is static in nature. the “transactional link” acts just like a fact table – because the data cannot / should not ever be updated (legally), it can reside directly within the link structure. the caveat? no satellites can hang off this type of link structure without breaking it apart again.
also, keep in mind: that for mpp reasons, separation is one of the keys to scalability (read my previous blog entry on the subject).
3: question 3: question #2 is amplified when we have a relationship that is essentially immutable. for example a policy has one-to-many policy terms. there is no way that a policy term will every apply to a different policy; this is a matter of composition rather than assignment. does it make sense in these cases to include policy id (sk) as part of the policy term satellite rather than creating a new link table?
no no no! relationships are never immutable over time. every time i hear that a “relationship is immutable” it makes my skin crawl – because what that translates to is: “the business never changes it’s mind about how data interacts with each other.” it also translates to: “the business never retires old systems” or “the business never brings in new data to integrate in to the warehouse.” by the way: no relationship is ever immutable, why? because to make that claim would be to say: “my data is always perfectly in line with the relationship, we don’t have data quality problems.” i’ve yet to see a system where data quality is _not_ an issue. anyone else like to chime in to this?
now, policy id in a satellite? no, we don’t move business keys to satellites – this is a huge break in the flexibility of the data vault. i’ve got some posts, and a few blog entries that describe what happens when the architecture is “changed.”
4: question 4: are there cases where a link table could have content? consider the adventureworks example of the lnk_orderdetail which has orderid and productid as the primary key. i would like to keep unitprice, discount, quantity in that link table rather than creating a link sat. what is the argument against this? and how would one relate lnk_orderdetail to lnk_sat_orderdetail with creating an orderdetailid surrogate key? would you suggest joining on productid/orderid?
again, transactional links – but these are merely stop-gap solutions. in very large data warehouses with high degrees of parallelism (in a shared-nothing environment), dividing and conquering (separating descriptive data from the associative keys) is the only way to make it scale. there are hundreds more arguments against this, and this is a discussion i’m currently writing into the data vault modeling book. it will also appear in the membership support section of this site.
all link tables carry surrogate id’s or as i like to call them: sqn (sequence numbers). why sequence numbers instead of id’s? because the term id means too many different things to too many different people. we must standardize on terminology if we are to optimize the edw building process and succeed.
5: question 5: you mentioned that the dv is intended for insert-only essentially making all data type ii. this makes querying in any normal rdbms completely impractical. multiple select max subqueries would kill the best of servers. a few thoughts on this:
a. many dws don’t need this level of historical detail; i’m considering updates in cases where historical data cannot be justified.
b. another approach to this is having separate tables that hold “closed” dates for surrogate keys on the historical table. this allows us to continue with insert-only while still articulating end-dates on historical records. the current-state query is basicall select * from historical where surrogate_key not in (select surrogate_key from historical_closed). mpp databases such as netezza perform magnificently with this approach.
c. any other thoughts on this would be appreciated.
the satellites are type ii (insert only), the links and hubs are insert only of new keys and new relationships – they don’t qualify as a type ii dimensional definition. now, the article for satellites is 7 years old and getting older every day. when i wrote the article i can’t remember if i discussed satellites having start and end dates – but i do know that in the certification class, this is something we cover. specifically to remove multiple select max subqueries, believe me – it does the trick. in the certification course we also talk about why an update can be run against a system managed field such as start and end dates, and why this does not break auditability.
5a. any time a user tells me “they don’t need this level of historical detail” i ask them: “when was your business audited last?” will it be audited in the future? does the business user realize that by “not needing this level of historical detail” that they run the risk of failing an audit? which is by the way, really-really bad – and can kill data warehousing in that company for a very long time. just ask some of the companies who only recently decided to try it one-last-time, and went the data vault route (they are very happy today as i understand it).
5b. closed dates for surrogate keys – ok, only in the satellites. and i call them “load end dates.” by the way, since you brought it up – we have our first successful test case of a data vault on netezza today! and it screams!
5c. other thoughts, hmmm – anyone else care to chime in? always love suggestions 🙂
i think actually – without too much, we made it through all the questions. i really hope this helps some of you folks out there struggling to find answers. hang in there, a “technical data vault modeling” book is coming as i’ve indicated in a previous post.