#datavault #agile #methodology #edw and business value

there seems to be an interest (again) in having a short and simple “business value” definition of the data vault model & methodology.  that said, i will do my best to list it as succinctly as possible.  however, i would love to hear your thoughts on what you think is business value, and whether or not it’s a differentiators for the dv.

more specifically, the differentiators that make the data vault model and methodology a unique value proposition.

issue #1:  people confuse and combine the methodology with the model.  please don’t make that mistake.  the dv model is just that: an architectural design pattern for encapsulating, and organizing data sets.  a data model!  the methodology is just that: a method for implementation, combining repeatability, scalability, flexibility, performance, ease of use, ease of replication, consistency, and it implementation processes.

ok, that said:  there are two categories of value add propositions:

category #1: the data vault model

the model has the following business values (as i see it) – feel free to add more, or ask me to change this list.

  • flexibility – ease of addition of new table structures without disturbing historical data stores, and without the need for re-engineering other parts of the model.
  • scalability – as it relates to  traditional relational database management systems (rdbms) data stores.  the ability to scale your model due to the horizontal and vertical partitioning that naturally take place with the hubs links and satellites ensure that you can take it to petabyte levels in an mpp environment without changes to the standard model structures.    the dv model is based on set logic and the frequency of data changes, allowing you to split and merge data sets without losing history and without impacting any other parts of the edw model.
  • consistency / repeatability – hubs are hubs are hubs, links are links are links, and satellites are satellites are satellites.  it doesn’t matter how big the model gets, or how many tables there are.  the design of the model never changes, it remains consistent and repeatable pattern based approach for your entire edw lifecycle.

now: because certain folks insist, but are unwilling it seems to ask for comparisons to other data modeling methods, i will try to elaborate here:

3rd normal form as an edw: exhibits these problems

  • flexibility – 3nf as an edw (with time-series or temporal adaptations) causes versioning problems in the data sets.  in a transactional definition of this word, it basically means “transactional consistency might be compromised” because of the way parents embed their keys directly in to child records.
  • scalability – 3nf as an edw has trouble scaling to any size.  this includes the number of tables, as well as the overall performance of the model, and the amount of data that it can store in a parent-child-relationship is “finitely limited” as described by discrete mathematics.  in other words, once again the parent-child enforcement of relationships eventually limits the ability of the 3nf as an edw data modeling paradigm
  • consistency / repeatability – ok, in 3nf as an edw – everything is supposed to follow 3rd normal form rules.  but because of the temporal aspects of the data, it cannot.  it follows 2nd and sometimes 1st normal form rules. in order to overcome these limitations, people adapt or change the data model – resulting in a serious loss of fidelity.  in other words, they introduce “exceptions” to the model itself, to the architecture.  as the model grows, it becomes less & less documented, and less & less understood.  this is neither consistent, nor repeatable in design patterns.

dimensional model form as an edw: exhibits these problems

  • felxibility – dimensional model form as an edw: has trouble being flexible.  particularly if it is not modeled at the granular level  of the source data set.  why?  because when it’s modeled at a super-set level (or an aggregate level) it generally introduces summary data, along with hierarchies of data. the flexibility fails if a change is made to the underlying data that controls the hierarchy – resulting in forced re-engineering of the data model (generally this occurs at the dimensional levels).  if the dimensional model is at the pure granular level, then it is less likely to suffer problems – until the hierarchy of the data set changes!  you see, the dimensional model becomes “brittle over time” because it is a data driven design rather than a pattern driven design.  it represents sets of tuples and relationships combined in to single dimensions.  there-in lies the application or restriction of finite mathematics.  finite math will state that as complexity rises (which changes drive dimension complexity), that ultimately the cost to maintain such complexity will over-run the budget or the ability to keep it alive. thus forcing a re-design, or re-engineering to occur.
  • scalability – ok, a lot has been done at the infrastructure level over the past 10 years to support dimensional models.  it’s where “star join optimizers” came from, and so on.  so yes, today – with the right hardware, and with the right budget, you can scale a star schema effectively.  now scalability of the data set is a different issue.  when dimensions become too wide and they blow the rows per block ratio out of the water, then i/os will rise beyond the benefits seen from denormalization in the dimension (see my past post on joins, normalization, and denormaliation).  so what happens is the dimension can no longer perform effectively.  so what does the customer do?  buy bigger hardware, or turn to an appliance like netezza for “big table handling.”  the data vault model will scale further in raw data form than a dimensional model before you ever need to even look at moving your edw to another platform just to get it to return results.  however: if you put both models on an rdbms with compression on – the dimensional model will “most likely” outperform the data vault until a certain size is reached.  where the i/os for table scans and joins outweight the cost of distributed computing power – and parallel compute abilities.  this can be seen in todays “trend” to go to the ultimate 6th normal form k=v store: hadoop and its’ ability to handle terabytes of information, where document stores simply haven’t got there yet.  same thing with the data vault model when comparing it to a denormalized dimensional model.  its all in the mathematics here folks.
  • consistency / repeatability: well, once again – dimensions are dimensions, facts are facts.  except when?  except when dimensions contain hierarchies, or dimensions are snow flaked hierarchies, or when dimensions are degenerate dimensions.  and facts, well when facts are non-fact fact tables!  what?   oh yea, factless facts. or when factless facts are helper tables, or when factless facts are dimensional link tables….  ok, do i make myself clear?  consistency of this kind of model with all kinds of exceptions to the modeling rules does not bode well for a scalable enterprise model.  at the “core” or first glance of dimensional models, yes – facts & dimensions are simple and easy to understand.  but then get in to the real world, and what do you get?  mass confusion as the model grows to scale the enterprise (i’m talking about the number of tables here, not the size of data).  there’s one more problem (mentioned above in flexibility).   dimension to dimension will be different (generally).  why?  because the business rules are upstream of the dimension, causing the aggregates to change when the business rules change – which then again, causes cascading change impacts to the “child” tables linked to that dimension.  and if you are not changing your facts when are changing the grain of your dimensions, then your facts will not add up properly for the business – and you will end up tearing the whole thing down and re-engineering a new model to meet the business needs.

category #2: the data vault methodology

ok – so those are the fundamental data model comparisons – and by the way, this isn’t magic sauce folks, it’s basic mathematics (discrete math, linear math, finite math, set logic math, and the math of complexity).  please please please, do not confuse the model for the methodology.  they are two separate and distinct things!!!

the methodology is just that: a method for implementing your enterprise data warehouse in a timely (agile) fashion with the least amount of riskpre-assigned patterns, and optimized it (business) processes to lead you to a consistent goal.

there are hundreds of methodologies in the world today for all kinds of things, and no one methodology is king.  you should adapt, apply, use, and learn from all methods and approaches that suit your particular needs.

ok, now for a bit of q&a:

q: is the dv methodology really that different from waterfall or spiral?
a: yes, in one way: it’s a hybrid, just like the model is a hybrid.

q: can i use the dv methodology without the dv model?
a: yes, but you increase the risk of failure of your project, because the methodology has been tuned to work with the data vault model

q: can i adapt parts of the dv methodology in to my own project management methods?
a: yes, of course!  the dv methodology is created in a modular fashion for just that reason

q: what makes the dv methodology special when compared to waterfall, spiral,?
a: the dv methodology is specifically tuned and targeted at the business practice of creating an enterprise data warehouse.  waterfall, and spiral are project management practices which are included in the dv methodology.

q: well, wait a minute, why then is dv methodology tied to “agile”?
a: i’ll let my good friend kent graziano handle this one, but here’s my two cents:  because of its consistent pattern based design and project approach.  it can be rapidly built, easily adapted, and even more: automatically generated (these days).  but in the end what makes a project agile is truly nothing but the people who build it and how fast they can respond.  the dv methodology is a quick and easy ramp up for all those who follow it.  it has dedicated and prescribed standards and rules that guide the entire process.

well, that’s all i have time for now, if you have further questions, or perhaps even insights – i am sure i missed a bunch of things here, feel free to comment.

dan linstedt

Tags: , , , , , , , ,

One Response to “#datavault #agile #methodology #edw and business value”

  1. Kent Graziano 2012/06/07 at 8:23 am #

    Thanks for the shout out Dan!

    So DV and Agile – pretty much what Dan said about everything being pattern based and generate-able is what I found to be the factors that allow me to take an “agile” approach with a data vault project.

    Currently I am on a project doing 2 weeks sprints and we can easily build out a few (sometimes more than a few) hubs, links and sats (then facts and dimensions on top) in each sprint. Because of the structure of the model, we have had no issues hooking the data from one sprint to data uncovered in the next sprint. We generally work from a business process decomposition and state change diagrams – which maps very well to data vault structures.

    Because of the repeatability of the model patterns, I have SQL templates for developing the load routines. With these I can whip out first cut insert statements in minutes then hand them off for “polishing” to the ETL programmer. In fact based on those templates, she has now developed PL/SQL programs that generate the load procedures dynamically! With that it becomes very easy to refactor the loads each sprint if the model changes.

    As Dan stated, it is not that DV is an agile method itself but rather the nature of the architecture and the patterns lends itself to applying agile techniques (like sprints). I actually stated this and was doing it (with a data vault model) almost 10 years ago at Denver Public Schools.

    You can do agile data warehousing without data vault, but why would you want to?

Leave a Reply