what is data vault 1.0?
the current iteration of the data vault was created over 20 years ago and published over 10 years ago. systems and technologies evolve. so do architectures and methodologies.
up until now the data vault model and methodology worked very well with most systems, especially relational databases.
there have always been issues surrounding columnar databases, key-value pairs stores, triple stores, cloud databases and unstructured data. i’ve even written about the issues with them.
new technologies like hadoop are becoming more and more popular and i’ve had a lot of pressure from both students and clients on implementing a data vault on big data systems and/or within an environment of such systems. it’s also a reality because these new systems are emerging as both sources and targets and the existing standards for the data vault simply do not fit.
although the data vault has scaled to multi-petabyte capabilites as far back as 8 years ago, today’s systems can handle volumes in multiple factors of that and scale linearly, however they’re not the same. using the same thinking that we apply to relational databases on these systems just won’t work.
there are also substantial performance improvements in the new model with tweaks to the data vault modeling standard.
“everything out there today about the data vault model and methodology that
has been referred to or is being referred to as the data vault is
now deemed as data vault 1.0.”
it has stayed unmodified for over decade which is not a bad run at all. any usage of the term “data vault” with reference to the data vault model and methodology and the data vault architecture for building data warehouses can automatically be labelled data vault version 1.0
solving problems to introduce new ones has never been the goal of the data vault.
therefore evolution of systems and tools has led to the new and improved …
data vault 2.0
data vault 2.0 subsumes and supersedes data vault 1.0. this is an automatic extension because 2.0 has enough new material and changes to the standard that it warrants versioning both for protecting any investment already made in the data vault by companies or individuals and for the natural evolution of the standards. (this is not very different from the evolution of star-schemas to the kimball bus architecture or with bill inmon redefining cif with dw2.0 to address the needs of today).
data vault 2.0 has a bigger focus on implementation than data vault 1.0.
data vault 2.0 addresses many issues including performance concerns of the data models, big data, real-time and unstructured data components.
data vault 2.0 addresses issues with big data systems like hadoop and hpcc compared to relational database systems like sql server, oracle, mysql, postgresql, db2 and others.
data vault 2.0 addresses mpp databases like teradata, greenplum etc.
data vault 2.0 addresses issues surrounding other nosql databases like mongodb, riak, couchdb, cassandra, hbase etc.
these are reality and need to be addressed. customers are already asking for it. there are customers who have come forth and are willing to be beta testers with the implementation of a data vault directly on hadoop using hive (and there’s a lot of exciting news on that front).
while data vault 2.0 is still in it’s nascent stages, we have already done performance tests on the new modeling techniques using relational databases with substantial improvements. hadoop is in the lab stage.
data vault 2.0 is a new specification and the certification will be introduced in 2013. so far the only place you can get this certification is directly through me or through folks who have signed an agreement authorizing them.
the licensing for dv 2.0 is still being worked out. it will probably be along similar lines as 1.0 with the modeling being “open source” but i will retain copyright to it. we’re debating on this and it may just end up being trademarked like dw 2.0 or may follow some flexible open source license.