you decide. i recently read a new “tip” on the kimball site regarding “new types” of slowly changing dimensions. i received an email from my good friend kent graziano about the tip, but also reminding me that bill inmon wrote about this 20+ years ago in his original writings, and that i had added it to data vault modeling 15+ years ago when i published the standard.
if in fact, kimball & camp is aligning star schemas towards data vault mdoeling, i guess that’s a good thing, as copying is the sincerest form of flattery. on the other hand, i’d like to hear your opinions – what do you think kimball’s post really says? does it align with data vault modeling or not?
of course there is still the fundamental question & issues to deal with like: how does kimball truly define a “data warehouse?” and, where exactly should the business rules live? i know my views on the subject, but i’d like to find out more from you. please add your comments and thoughts to the end of this post.
disclaimer: please remember: 1) i am biased 2) i am specifically referring to the use of star schema modeling as an edw in this context – i am not referring to the use of star schema modeling for production and release to the business users. i do believe in star schemas for data release to business users, just not for use as a back-end edw.
where’s the post?
overlap? what overlap? are these truly new innovations?
with type 0, the dimension attribute value never changes, so facts are always grouped by this original value
in the original specification of the data vault methodology i define clearly the use of raw data sets that are never changed by business. one of the impacts to following this method, is that it cannot be conformed, altered, or changed in any way. however, no where in his definition of a “type zero” does he even remotely suggest that the business rules should be moved downstream. i would tend to say that if you truly want a “type zero” kind of thing, that you should really be considering using data vault modeling in the first place. at the very least – you will now be required to move the business rules down-stream, otherwise the descriptive data that is “changed/munged/altered and conformed” will change over time – making a type zero near impossible to achieve.
the type 4 technique is used when a group of dimension attributes are split off into a separate mini-dimension. this approach is useful when dimension attribute values are relatively volatile.
did i miss something here? this is the standard method of data vault modeling, and has been there from the beginning in 2000/2001 when i released the data vault modeling standards. splitting of satellites by type of data and rate of change have always been a good practice. now, he finally suggests it for dimensional modeling. ok, no problem – it is a good advancement. but yet again, i have to ask the questions:
- if you are changing data on the way in to the dimensional warehouse, how then would you track changes back to a mini dimension?
- and even harder to answer: where and what represents the business key in this “mini dimension”? in other words, if you don’t have a business key identified, then it is near impossible to go back and “end-date” the old record, and activate the new one. in fact, in his example, he doesn’t even show “temporal” aspects of data change.
as if it wasn’t hard enough to track changes, he continues on to introduce type 5, 6, and 7
the type 5 technique builds on the type 4 mini-dimension by embedding a “current profile” mini-dimension key in the base dimension that’s overwritten as a type 1 attribute. this approach, called type 5 because 4 + 1 equals 5,
what? what kind of serious logic is this? you must be joking – using simple addition to justify changes to an architectural design?
adding 4+1 = 5 is mundane. what about the serious definition? the architecture considerations? the risk of snowflaking that has been proven to be bad to begin with (bad for performance, bad for design, bad for architecture, etc…) now to apply a band-aid to an ailing “modeling technique for data warehousing” he simply says do a little simple addition, and voila – you have a new dimensional type.
the etl team must update/overwrite the type 1 mini-dimension reference whenever the current mini-dimension changes over time. if the outrigger approach does not deliver satisfactory query performance, then the mini-dimension attributes could be physically embedded (and updated) in the base dimension.
really? you’ve got to be joking…
he just contradicted himself for the purpose of a mini-dimension that is “type 5” – it cannot exist according to the statements he’s made. let me explain: first, he said: use a type 4 for the “faster changing information” (the mini dimension), use a type 1 for the main dimension. then he said: the outrigger is a type-1 mini dimension and should be overwritten (meaning no history). well folks, which is it? he goes on to then suggest denormalizing (re-combining the type 4 + type 1) back together again to re-form a type 3 dimension if performance doesn’t work… wow, so in other words there is no new innovation here at all, and it would be near impossible to construct the correct “mini-dimension” given the multiple definition.
breaking it down:
- type 4 mini-dimension is updatable, and is broken out because of rapidly changing attributes
- type 4 mini-dimension should really be a type 1 mini-dimension with a type 4 main dimension – making this a type 5
- type 5 is really a type 1 mini-dimension with a type 4 main dimension – that is de-normalized back together at the first sign of performance problems (ending up right back where you started with a huge dimension and copying attributes that change frequently).
the data vault model in contrast is based on mathematical study of data, the rates of change, and the classifications of data. this study also includes the study of horizontal and vertical partitioning of data for performance and parallelism reasons – hence it’s close alignment with mpp systems (think big data and nosql). the data vault modeling paradigm is not abritrary and does not simply “append” styles just to meet the needs of true enterprise data warehousing, no, the data vault modeling paradigm is built on a solid foundation of tried and true methods and best practices.
on the one hand it’s good that dr. kimball is trying to innovate – for that i applaud his efforts. on the other hand to show such non-concern for the nature of the data itself and merely suggest “new types” of design patterns to an already troubled architecture is quite disconcerting. i fear that these “advancements” will only feed the fires of edw troubles more quickly.
type 6 builds on the type 2 technique by also embedding current attributes in the dimension so that fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value. the type 6 moniker was suggested by an hp engineer in 2000 because it’s a type 2 row with a type 3 column that’s overwritten as a type 1; both 2 + 3 + 1 and 2 x 3 x 1 equal 6. with this approach, the current attributes are updated on all prior type 2 rows associated with a particular durable key, as illustrated by the following sample rows:
what a minute… he just told you to put measurements or facts in a dimension. if this is true, then what’s the point of assigning monikers like dimension and fact? if this is true, then he’s just conflicted the standards and the definitions (not to mention destroying the patterns) for putting specific types of attributes under specific labels.
i don’t know if you caught it: but he states in this paragraph that the attributes are updated on all prior type 2 rows… now if i return to the base definition of enterprise data warehouse, i believe that inmon defines the data set as non volatile. that should mean that there are no updates to existing data sets of user based data!! he just suggested that an update (which changes data values of data sets that arrived from the source systems) be executed against user based data. at that point, if you follow this rule, you do not have an enterprise warehouse. you have a data mart, where the information is subject to change.
he is updating the “current strategy” column, which is ok – he’s maintaining audibility in the historical column but then again, from a big data perspective how good will the performance be over time? the speed of the updates will continue to degrade as the data set grows.
for this case, proper sense would dictate that you build a new “higher level type 3 dimension” and fact table combination on top of the low-level detail. this higher level of grain would house the current assignments and overrides. eliminating the need for updates entirely, also eliminating the need for “current row indicators”
finally kimball introduces:
with type 7, the fact table contains dual foreign keys for a given dimension: a surrogate key linked to the dimension table where type 2 attributes are tracked, plus the dimension’s durable supernatural key linked to the current row in the type 2 dimension to present current attribute values.
type 7 delivers the same functionality as type 6, but it’s accomplished via dual keys instead of physically overwriting the current attributes with type 6. like the other hybrid approaches, the current dimension attributes should be distinctively labeled to minimize confusion.
ok, interesting. so, you can query by “active label” versus “previously assigned label” or dimension row. this one almost makes sense.
conclusions and summary
in my humble opinion dr. kimball is combining (unsuccessfully) two major functions that shouldn’t be combined:
- storage, history, non-volatility (data warehouses)
- presentation, preparation and release (data marts)
he’s trying to hard to get the data modeling archtiecture of star schemas to “bend” and meet the needs of the enterprise data warehouse. he has introduced massive complexity in to his systems, and the etl routines to keep all this straight will end up as spaghetti code within the second or third project iteration.
a life lesson… i was always taught to take a big problem in life and break it in to bite-sized chunks to solve it, and not try to force fit solutions or build band-aids as short stop gap measures. this is why i hold fast to the belief that the data warehouse is physical and is not the same as a data mart layer. that view provides me the opportunity of separating the “jobs” and responsibilities of presentation from the jobs and responsibilities of historical storage and integration by business key.
in my opinion, again, i applaud dr. kimball for attempting to innovate, only in this case, i believe that these innovations will prove out to be ill-fated prescriptions for those who follow the advice, and continue mixing data warehouses with data marts in the same architectural layers.
please let me know what you think of all this. do you think it will work? do you see benefits to his approach? do you see the overlap with the data vault modeling techniques?