i recently had the pleasure of an e-mail exchange with regard to nosql platforms (one in particular), and the use of data vault models. i need to start by saying that this post is published with permission of the author of the email question. in this post i will dive a little deeper in to the work that sanjay pande, michael olschimke, and i are working on together around big data, nosql, data vault modeling. this is all still work in progress. i am more than happy to entertain corrections, additional thoughts, comments, and questions – just post them at the end of the blog.
graham pritchard, eim, steria (uk)
as a relative novice to dv models and methodology i’ dbe really interested to understand your high lvel views on how this methodology relates to the potential offered by nosql platforms such as marklogic to suck in disparate forms of data into a data lake, and then effectively model it retrospectively.
is there a case for both approaches? can they complement each other? if so, how?
my first response:
thank you for contacting me. these are warranted, and insightful questions. let me just start by saying: there are over 150 different vendors (at least) in the “nosql / big data space”, each one has their own view of how they solve certain problems. trying to build a “standard” that works on all vendors (at least today) is next to impossible.
that said, i took a look at marklogic (on-line, read some technical documents on their triplestore engine, etc…) – so i must first say that i am not an expert in their technology, nor have i had the chance to actually work with it in a hands-on project. so i am a bit lacking in experience, and thus – my comments will be further tempered by this situation.
in terms of the work sanjay and i are doing, we have worked hands-on with cloudera, and apache hive, apache hadoop, and are beginning to experiment with mapr, and hortonworks. my customers (today) that are using hadoop “platform” in their edw environments, are a) quite large customers, b) have “big data problems” – both structured and unstructured, c) are applying the “hadoop platform” mostly for two reasons: as a staging area in the bi space and for deep analytics – where queries can run long, and file scans are not a problem.
all that said: when you take a _hard look_ at “nosql” or a hadoop platform, from a modeling perspective a few things jump out:
1) 90% of these systems are really good at ingesting “anything you want to throw at it” without any modeling what-so-ever, hence: schema-on-read
2) in order to “make business value” out of any data, you need to classify, organize, and identify the information in the files (one way or another).
3) eventually, some sort of “data model” is needed in order to reach the business with result sets that make sense (ie: turn data in to information).
so, at the end of all of this: here’s what is happening in data vault 2.0 space / customer world:
1) customers want to leverage their existing relational dbms investments (not just throw them out, or dump & replace with nosql / newsql, or hadoop platform)
2) customers want to “release” the dependency on loading to the relational edw (ie: they don’t want to have to “look up” some sequence number just to allow an insert to a hadoop platform)
3) customers want to “tie” the relational data sets in the rdbms to the nosql (relational, or multi-structured, or unstructured) data living in the hadoop platform.
so – to that end, data vault 2.0 modeling changes the sequence key – replacing it with a computational hash key value. this answers all 3 of the needs above.
now, the next topic: how does dv2.0 work / provide value to just the hadoop / nosql platforms by themselves?
depends on what answer you want.
1) separating hubs and links basically provide physical indexing capabilities without moving the entire “set” / or file across all the mpp nodes, making “joins” a bit easier on distributed systems
2) attaching hash keys to “json documents, video files, audio files, images, documents, or even structured/multi-structured data” allows joins from the relational world to the non-relational world, or even – again, better indexing (as described above in #1)
3) satellite “models/tables” in dv2 for data residing on hadoop platform are only justified as schema-on-read, or if you want an optimized access path through hiveql via internal tables for instance. so that means the satellite table structures are logical in nature (when on hadoop platform).
business value speaking:
1) hubs still provide business value as an integration point for a unique list of business keys
2) links still provide business value as an integration point – for a unique list of relationships / attributes
3) satellites – well, again no real business value except for housing data over time. they only provide “technical value” see point #3 above. the business value is schema on-read, determined at run-time of the query.
unfortunately i don’t have hands-on experience with marklogic, so i am not qualified to state the pros and cons of applying these thoughts directly on their platform. nor am i qualified to discuss if their platform would or would not benefit from data vault 2.0 approaches. however, it is safe to say: that if they have any of the pain points mentioned above, that the data vault 2.0 modeling components would assist in building business and technical value.
i hope this helps, sorry if it’s not clear enough – we are currently working “in our labs, on our servers” on specific implementations in an attempt to find these answers.
lest we forget: dv2.0 is more than just the data vault model, it includes: architecture, methodology, modeling, and implementation. so it brings the entire package together for agility scrum, unstructured/no structured, nosql, automation / generation, and more.
please feel free to ask more questions, should you think of them.
sanjay also replied:
as dan already stated, there’s so much variety out there right now just in solution choices. even proprietary big data vendors like lexisnexis have now open sourced their core technology (hpcc) to complete with the “elephant in the room” – which is arguably more mature and advanced than the current iteration of hadoop (mapr being the only exception due to their own approach to architecture).
that said, our current research is focused on the apache hadoop platform for which we offer an alternate though process and an architecture (currently experimental).
there’s a lot of existing learning and deep knowledge-base in the bi world that’s getting thrown out with new technologies and i’m particularly sad to see this happen time and again causing unnecessary churn and re-learning of concepts in future.
marklogic is interesting, as is the disco project by nokia … and so many others.
the maximum momentum currently is gathered by hadoop and technologies surrounding it or built on top of it. as is customary with it hype cycles, everyone and their uncle jumps on the bandwagon and then we techs end up supporting architectures that eventually break.
vendors never help. the situation looks very similar to commercial linux vendor offerings when they happened, but linux wasn’t really solving a pressing problem whereas hadoop is in terms of storage. linux eventually displaced proprietary unix. the competition is also a lot more intense than 10-15 years ago in this particular space with the same organization developing competing solutions which further confuses the implementers (example: cassandra and hbase are both apache projects with minor differences. they’ve been forked as well and have other non-apache open source and proprietary competition).
the progression of technologies produced by the current open source big data crowd to me says, they know a lot about handling large amounts of data storage and processing technologically, but without understanding the business problems related to bi. which is why you see things like apache drill and impala working hard to best each other when another solution on hbase called phoenix already beats them fair and square, and spark/shark increase query efficiency of hiveql queries as well.
the separation of the storage and archiving activities and pushing business rules downstream of the data warehouse into the information marts is what has given the dv (and now dv 2.0) it’s share of success by essentially minimizing churn and pushing the “changeable” aspects of the solution as close to the consumer as possible.
the data vault (especially 2.0) is also a solution blueprint for business intelligence applications with various portions and how to build them done well and tested in the real-world.
now, while there are many non-bi applications for big data technologies, there are several advantages in leveraging learnings from the dv as and where one can especially with patterning and code generation for quick build outs and for reduction of maintenance churn.
our use cases are limited to business intelligence on our research of these platforms.
grahams’ follow on thoughts:
hi dan (and sanjay),
you have mirrored many of my thoughts in your response, which hopefully is a positive.
in terms of nosql schema on read modelling aligned with semantic triple store capabilities – and i might have this entirely the wrong way around – my instinctive view is that this from purely an insight perspective potentially removes many of the hard yards traditionally employed in developing the model (notwithstanding the methodologies employed) up front. this is where i think my initial question to you was coming from.
simply put, i am trying to work out in my own tiny mind, how to deliver best value and advice to my customers. i believe that dv has a real role to play in this, but i think that role may have changed since the realization that nosql platform with added search/discovery functionality, mapreduce and semantic triple store capabilities can consolidate, identify, analyse and present information in a consistent, coherent and flexible manner.
in short, in the new world the potential end to end process (in simplified form) looks like:
- identify relevant internal and external data sources based upon business need
- load and index data ‘as into nosql platform, irrespective of format
- build tripe store indexes into same platform
- use multi-tiered storage and mapreduce analytics where appropriate – hfds/hadoop
- group related data clusters into document forests for interrogation and subsequent eu access
- present data via ui of choice; search capabilities, bi front end, monitoring dashboard, esb
this obviously overlooks data and information governance, data quality loops,etc,etc, etc, but potentially provides a process that captures all data (auditable) and potential subsequent changes in the one platform, and makes all that data available in real-time without many of the overheads and challenges that often afflict bi/dw deployments whilst providing business with a level of insight within highly responsive timeframes and also with the ability to draw insight from within unstructured data sets alongside structured data elements. this, at its simplest level is clearly very attractive!
all that said, i personally struggle with not having a data framework (kimball,dv) to work within from a business information consumption perspective, but my presumption is that this will be post-ingest (would you agree?) into a data lake from which a form of data framework (dv) can be established for the multi-faceted presentation layers.
not sure if any of this makes sense, but i’m really interested to hear how you see the juxtaposition between a) where we’ve been, b) where we are, and c) where our industry is going.
sanjays’ response to the above:
thanks for the slides. it gives me some perspective.
as i said earlier, it’s interesting. hpcc is even more interesting (to me personally) because they’ve had some of these capabilities for the last 10 years.
i have certain issues with:
- data lake – dumping without organization is in my opinion, the worst idea i’ve ever seen in the industry, even though i can see the “sales” appeal of it. it only generates more downstream work. in my personal opinion, a dv can even help organize it better.
- schema on read – again, generates more downstream work, despite it’s advantages.
- semantic triple stores aka graph databases not only suffer from scalability issues (which appear to be solved with this vendor), but also non-linear approach to a solution when one is needed (the simplest examples being aggregation of sets). they’re fantastic for link analyses or exploratory work, where it would be pretty difficult to implement in relational databases (with exceptions), but they’re still not ideal for a majority of bi use cases as we see them today.
- sparql is a standard but even graph vendors don’t like it so much. neo4j for example supports it, but prefers cypher (their own homegrown query language).
- it’s great to see acid compliance in a nosql database and it for sure will help adoption, however i see the c-consistency in the bi space pretty loosely defined where “eventual consistency” such as in cap is quite acceptable in most use cases.
all that said, i’d be surprised (rather pleasantly) if the downstream builds which are eu facing will be easy to build and perhaps even automate. how do you automate when you don’t have patterns and everything goes?
the data vault is only one potential solution and it may not even be right for the architecture you need. for most bi use cases, it’s the dw component (leans more toward’s an inmon style multi-tier but with capabilities to logically project marts to the business users with the flexibility of persisting if required).
in the relational world, it’s been proven to be well worth doing.
in the hybrid world, it serves as a glue where the nosql objects are tied to the relational world via hashed business keys.
in a purely nosql world, the jury is still out. it’s one possible solution and it’s definitely a good way to organize a data lake and make it more like a data warehouse while leveraging an existing knowledge base of bi history which enables downstream information mart builds as well. there’s of course more to it because the technical components are really a small part of the project. besides modeling, it includes build cycles, automation, project management and all other components.
also in terms of the juxtaposition question.
a) where we’ve been hasn’t changed for us because despite a history of successes and many solved problems we still hit the wall many times with people who just don’t want to see beyond kimball and still go and implement solutions which are doomed to eventually fail.
b) where we are is a very interesting place in history in technology. i like it because finally, we’re getting out of the grip of relational thinking. something that should have happened in the 1980s when lisp machines were available. it also becomes a very confusing environment because it’s filled with innovations with more to pick and choose than anything else.
c) if history tells us anything, marketing is what will eventually win over technical merit and the technically better solutions will finally get implemented into mainstream after a considerable amount of time with everyone thinking it’s the best thing ever to happen in technology. i’ve seen it and can pinpoint so many examples.
the e-mail trail continues, but for now – hopefully this is enough to give you a glimpse in to what sanjay, michael, and i are working on when it comes to nosql, bigdata, hadoop, and datavault.
again, if you have any additional thoughts to add to this conversation, comments or observations, please add them to this blog entry.