#datavault #hadoop #hive #nosql Modeling Breakdown

i’ve begun researching data vault models on hadoop solutions, including hadoopdb and hive.  recently i came across a number of articles which describe the solutions of hive and hadoopdb in detail on top of hadoop solutions.  i had to take a minute to write this article, to explain my view points of using the data vault model on a hadoop solution.  i also explain where the data vault model fits in the nosql or non relational world, and why it’s still relevant.  furthermore, i touch on the changing nature of certification – why it’s not so relevant any more.

what’s the beef with hadoop, hadoopdb and hive?

here is a scientific and mathematical article (one of many) which discusses the issues with hadoopdb when compared with hive. although it was written quite a while ago, it is still relevant.  particularly in the fact that hadoopdb “replaces” the hdfs back end storage of hadoop with it’s own relational libraries.


now, regarding dv on hadoop: i’ve said it before in my blogs, i’ll say it again:  the data vault *must* become a logical model, to the point where columnar db’s and kv stores, and triple stores don’t physically implement it.  in other words, hive becomes a great solution – because it turns the data model in to a logical only component.

personally, i am choosing hive as my implementation paradigm for testing in my data vault labs.

at this point i would highly recommend hive over hadoopdb, but that is a personal choice.

how to put dv on a non-relational dbms

if you *really* want a physical dv on a system like this, then you *must* follow these rules:

  1. de normalize – but only at the physical model level!
    • the hub must be replicated in to each satellite,
    • link parent data must be replicated in to each satellte.

it really is this simple.

this goes for db2 (as400), netezza, paraccell, vertica, green plumb, sybase iq, hadoop + hive, and any other kv store, triple store, graph store, or column based store.

why should i use the dv model then?

good question.  there are still business reasons for using the data vault model, such as flexibility, and adaptability – but you should be thinking: logical data modeland not physical data model.  however, at this point, the real value is in the methodology rather than the data model!

the benefits of the methodology still stand:

  • accountability
  • auditability
  • understanding of the business
  • tracking / typing the business processes to the data model
  • ease of use
  • ease of build out
  • low complexity
  • raw data loading
  • massive parallelism (from the job design perspective)
  • massive scalability (from the job design perspective)
  • easy team scale-up (easy to add team resources)
  • performance and tuning simplicity

so you see, the value of the model shifts when the physical implementation is changed to a hadoop solution.  the value of the model now becomes 1 component of the methodology; and then – it’s the value of the methodoloy that makes a whole lot more sense.

so, that said, can i use any data modeling i want and still follow the data vault methodology?

yes, in fact, the future is in methodology, and automating the methodology to our best abilities.  this is the focus of the future, and by looking at the cards, the future is already here to some degree.

wait a minute, did you just say don’t use the data vault model?

nopenot what i said at all, go back and re-read the benefits section.  if you believe that the model holds no value now, then you’ve missed my point.

in a non-relational data storage system, the data vault model still holds value as a component in the methodology leading to better business understanding!  among other benefits that are also there.

what i truly am saying is: the data vault model is fast becoming a logical design choice rather than a physically implemented solution.

for far too long, data models have been driven by taking logical designs and implementing them physically 1 to 1 in the relational database engines.  well, now that we have non-relational database engines, that entire philosophy is changing; and the data model (any kind) can finally begin to shine as a value-added asset to the business in a logical standpoint.

what’s the end result?

if you are moving toward any of the non-relational database management systems for storing your data (hive, hadoop, nosql, columnar, etc…) the end result is this:

  1. certification in the data vault model begins to lose some of it’s luster or value to enterprises.  all you really need to understand the data vault model, you can now get from the book on amazon or kindle select: super charge your edw
  2. however, understanding the data vault methodology on the other hand, begins to really be important!  as does learning how to automate the build out of your data integration solution.

more will come on the changing value of data vault model certification in future posts.

and from a non-relational (or hadoop perspective):

  • less and less emphasis will be (as it should) placed on the physcial data model.
  • more and more emphasis will be placed (as it should) on the logical data model
  • my prediction: more and more shifts will be made in business away from traditional rdbms engines, to non-relational engines (like hadoop+hive, and others)
  • there will be (in the future) a blended database management system that allows the best of both worlds in a seemless setup.
  • methodology & automation of your operational data warehouse / real-time data warehouse will move to the fore-front.
  • your ability to understand your business as an it individual, will also move to the forefront.
  • your ability to connect business processes directly to the logical model of where the data is stored, will move to the forefront.

again, the value of simple data model based certification is beginning to wane (across the board, and that’s not just for data vault modeling certification).  the value of becoming certified in a repeatable and scalable methodology is really where the gold is, or at least – the value of understanding how-to implement & generate a repeatable, scalable method for your data integration project becomes much more of a pressing need.

why? because the methodology is technology agnostic and can carry with you throughout any underlying technology changes.

i’m looking forward to hearing your comments, please do add them below.

dan linstedt



Tags: , , , , , , , , , , ,

2 Responses to “#datavault #hadoop #hive #nosql Modeling Breakdown”

  1. Kent Graziano 2012/06/04 at 3:16 pm #

    That last bit is pretty much what we are doing on my current project – modeling the data vault based on business process decomposition (and state changes). Currently implementing in Oracle but that may change before we are done.

    Based on the repeatable patterns our programmer is building a tool to generate PL/SQL load procedures to accept strings of information for real-time loads.

    Will keep posted as we progress. Very existing stuff!

  2. Peter 2018/11/21 at 5:08 am #

    Hello Dan,

    please could you elaborate a bit more in detail the topic of DV in HDP World? Specifically
    – No PK/Unique constraint in Hive – Insert all the values into the HUB? Or perform checks before inserting?
    – If storing duplicates, how to perform joins between multiple Hubs?
    – Are there any specific chapters/pages for the Hadoop world in the “Super Charge your EDW” book?

    Thank you very much!

Leave a Reply