i wrote a post a long time ago (2010) about the pitfalls and issues of data vault modeling. in this entry i will dispell some of the pitfalls and issues i originally discussed, as technology and platforms have come a long long ways. also, data vault 2.0 has been published and solves many of these issues.
lets take this bit by bit….
in my article i first state:
” thus, the data in the data vault is not for end-user access (direct access), it is for power-user access and data mining or discovery operations.”
this is not entirely true. the data vault model comes in two forms: raw data vault and business data vault. in the business layer we construct point-in-time and bridge tables, along with business rule driven information sitting in hubs and links and satellites, constructed for business access. the business data vault is not a full copy of the raw vault model, it is a sparse creation.
the other piece here, is that we utilize views (virtual dimensions, virtual facts, virtual flat-wide tables) directly on top of the pit and bridge tables, joining directly to the raw data vault model. these views (virtual marts) are directly accessible by business users. with this level of control, we can be flexible, dynamic, and tune queries for high speed.
that said: i have a customer with 360 billion to 500 billion records in their raw data vault tables, using row-level security joins in the virtual dimensions and virtual facts, and getting sub-second query response times….
turn our data in to information.
this is always the mantra of all properly built bi solutions, data vault is no exception.
“how can real-time be realized if the data has to stop in different places?”
real-time? no problem
real-time is accommodated easily by the data vault modeling patterns and loading patterns, there really isn’t any issue here at all. we buffer the output in accordance with the service level agreements (sla’s) by utilizing the point-in-time and bridge tables. these snapshot structures perform many tasks, including allowing high performance querying against the raw data vault directly without moving or replicating data downstream to physical marts!
in my previous article i opened up a list of business issues, i will now address each issue as they are in today’s world. my answers are in blue (right below the issue)
- old issue: data in the data vault is non-user accessible.
- answer: not true. data in the data vault is user accessible, through virtual dimensions, virtual facts, virtual flat-wide tables that all utilize point-in-time and bridge structures underneath. this way, we do not need to replicate the data to yet another target. when the business rules change, the team can respond with incredible speed and agility – by keeping those views virtual.
- old issue: data in the dv is not “cleansed or quality checked”
- answer: true. data in the data vault is not cleansed or quality checked. that said however, if you want “quality checked data” and to turn it in to information, it is done through the business rules on the way to the mart layer. this can be done inside the business data vault, or in the virtual mart layers, or in the pit and bridge table loads.
- old issue: benefits of the dv are indirect, but very real.
- answer: not all benefits are indirect. 90% of the benefits are direct, especially from data vault 2.0 (not necessarily from data vault 1.0 or just the data vault model). in data vault 2.0, the direct benefits are incredible agility, big data adaptation, inclusion of nosql / hadoop platforms, roi calculations on gap analysis in the business layers, performance of the queries, and so on. there are direct attributable and calculatable benefits that can be seen as a result of data vault 2.0 properly built!! what does direct mean? lower costs to build, to maintain, to enhance. faster time, easier automation, easier to scale, easier to run parallel, and the ability to calculate monetary value on each individual element.
- old issue: more up-front work is required for long-term payoff.
- answer: not entirely true. this all depends on scope. properly scoped sprints can result in one-day build and release cycles, again, this is dictated by data vault 2.0. this is not available in data vault 1.0. up-front work can be limited to a single set of output requirements, and a decisive list of business keys. these business keys can be resolved in a 20 to 40 minute kick off meeting with the right individuals involved. long term pay-off is seen right out of the gate, especially with implementing “day long sprints”.
- old issue: business users believe (in the beginning) that they don’t need an “extra copy of the data”
- answer: true!! in the beginning (due to slower hardware in 2010 when this original article was written), it was necessary to physicalize the information marts downstream. with today’s technology (as of q3-2017), this is no longer the case. platforms like snowflakedb, oracle exadata, teradata, sqlserver big data edition, kudu, impala, hive, and more – it is no longer necessary to physicalize the marts. we can (and often do) virtualize the downstream mart results without copying the data out of the raw data vault edw.
- old issue: elegant architecture is secondary to business churn.
- answer: true. this will always hold true, but what proves out time and time again in data vault 2.0 is that the elegant architecture as designed requires little to no re-engineering over time! this, in addition to virtual marts, cuts down on delivery time, and allows rapid and agile / fast deliveries to occur. sometimes in a matter of 20 to 40 minutes (specifically in day long sprints). iterations of these outputs can be made within a single hour (including unit testing). this is not specifically related to data vault! this occurs in any properly built edw.
- old issue: using a dv forces examination of source data processes, and source business processes, some business users don’t want to be accountable, and will fight this notion.
- answer: true. they still fight this notion sometimes, but more and more business users actually have a strong desire to fix their data and their source systems. remember, this list was created in 2010 (a long time ago!!). these days most business users want to find and fix their issues, they want to be transparent and auditable. so this is no longer a problem. this is not a data vault problem, this is a properly built data warehouse, that raises these questions.
- old issue: businesses believe their existing operational reports are “right”, the dv architecture proves this is not always the case.
- answer: true, this still happens. it is a part of gap analysis that occurs during normal and proper auditable data warehousing. this actually is not an issue, its a source of benefits for the business. this happens in every good data warehouse that is built from any raw data storage.
- old issue: business users from different units must agree on the elements (scope) they need in the data vault before parts of it can be built.
- answer: this is no longer true. with proper scope, focused on day long sprints, we can achieve wonderful output in extremely rapid turn-around times. we do not need to wait until the business users agree to get output from the data vault. in fact, we can deliver information even when they don’t agree.
- old issue: currently there is only one source of information exchange, there are no books on the data vault (yet).
- answer: no longer true, data vault 2.0 has been published in: building a scalable data warehouse with data vault 2.0 (available on amazon.com world wide). a huge book with 15 chapters and 698 pages, chapters 4 through 15 are hands-on exercises complete with real-world data sets, getting data in, out, through cubes, and even master data services are covered.
- old issue: some businesses fight the idea of implementing a new architecture, they claim it is yet unproven.
- answer: this is also no longer true. data vault 2.0 has been proven time and time again from the department of defense, to the nsa, to lockheed martin, microsoft, commonwealth bank (australia), pepper financial (australia), and more… it’s been proven all over the world in many different organizations, and many different situations. we have reference clients that you can speak with if you like.
now that i addressed the business issues, let’s address the technical issues with data vault 2.0 in mind….
- old issue: modelers struggle to grasp the reasons behind “not enforcing relationships” on the data model level.
- answer: no longer true. this is now viewed as the only method for little to no re-engineering of the data model, to leverage a many to many relationship is the only way to load history, current data, and future proof the warehouse model against business level change impacts.
- old issue: data vault model introduces many many joins
- answer: true, but this has been resolved. mpp also introduces many joins. we built a point in time and bridge table set of structures that overcome this issue, along with overcoming data co-location, partitioning, snapshots, equal joins, and high performance for virtualization. the join issues are no longer a problem. platforms have gotten bigger and stronger as well. anyone struggling with joins either doesn’t understand pit and bridge tables properly, or hasn’t gotten the right data vault 2.0 training.
- old issue: data vault model is based on mpp computing, not smp computing, and is not necessarily a clustered architecture.
- answer: data vault model is not mpp dependent! just because it is based on mpp mathematics, does not mean it doesn’t run exceedingly well on smp computing! in fact, it runs exceptionally well on smp machines. it does not require mpp in order to be performant or successful. all this means it is enabled to run on mpp if the platform is available. we have plenty of data vault solutions built on smp databases around the world. again, the answer for performance in querying lies in point-in-time and bridge table usage for getting data out, managing table and row elmination, star join optimization, buffering output, and sub-second query response times…
- old issue: data vault contains all deltas, only houses deletes and updates as status flags on the data itself.
- answer: this is true, any good data warehouse or analytical solution should show deltas only. this is no different than a type 2 dimension which also delivers deltas. in fact, solutions that are not delta driven often lead to double counting, and double loading (especially on failures and re-starts of loading jobs). this actually is a function of any properly built edw (data vault, 3rd normal form, or star schema).
- old issue: data must be made into information before delivering to the business.
- answer: half true, again, this is important for any properly built edw. this is not specific to data vault. even if you have “just a data dump/data junkyard” – it’s just data until it is turned in to information, and made usable by the business. that said: some data has value in it’s raw format, and can be delivered directly to the business (again through virtual dimensions and virtual facts) queried against point-in-time and bridge tables.
- old issue: modelers must accept that there is no “snowflaking” in the data vault.
- answer: true. snow-flaking hasn’t been recommended in dimensional models for years either. the risks are too great, and it causes nested sub-queries which do not perform well under heavy volumes (billions of records in a single table). snowflaking is poor and lazy design, whether it’s in data vault, dimenisonal, or 3rd normal form modeling. it also breaks the agility of the design, causing problems with “changes” that happen naturally to hierarchies over time, resulting in full re-engineering of old-style dimensions.
- old issue: stand-alone tables for calendar, geography, and sometimes codes and descriptions are acceptable.
- answer: true, although the proper term is called: reference data. we teach this in data vault 2.0 certification. this is no different than what we do in dimensional modeling with reference data, and no different than 3rd normal form. again, this is the responsibility of a properly built edw.
- old issue: 60% to 80% of source data typically is not tracked by change, forcing a re-load and delta comparison on the way into the dv.
- answer: not necessarily true. today’s edw’s are usually driven by audit trails and change data capture mechanisms, whether dimensional or data vault, delta tracking is required. (see above note on this issue). these days, most of the customers with data vaults are loading transactions off message queues in real-time. this way, only the changes / deltas are actually sent to the target for ingestion. again, this is not specific to data vault, but rather applies to all enterprise data warehouses that are built properly.
- old issue: tracking queries becomes paramount to charging different user groups for “data utilization rates” and funding new projects.
- answer: well, not necessary for success. 90% of the edw’s today still don’t track queries or utilization. only a very few rdbms technologies actually do this for the environment. this literally has zero impact on the success of a data vault, or a dimensional model, or a 3rd normal form data warehouse. that said: if tracking is turned on, the results and resulting analytics can be fed to a deep learning / machine learning algorithm to make dynamic changes to a data vault model (this is something the other modeling paradigms will not support. yes, dynamic data vault models can be created, maintained, and generated with neural nets, machine learning and metadata algorithms. i’ve done it with government agencies in 2003.
- old issue: businesses must define the metadata on a column based level in order to make sense of the data vault storage paradigm.
- answer: not true. the only business definition required in the data vault is understanding the business keys. that said: if a dimensional model is built properly, it too requires the thought process around business key identification! otherwise, how is a type 2 dimension truly defined? through disparate and unconnected metadata. that means the dimensional model will fall down, the same as a data vault model. in fact, if you read kimball’s data warehouse lifecycle toolkit book carefully, you’ll actually see the definition of a dimension calls for business key identification! after all, the data vault model is built from a hybridized best practice set across dimensional modeling and normalized format data modeling.
i hope you enjoyed this journey down memory lane. it’s fun to revisit and correct old notions, and to remind ourselves of the reality of true enterprise data warehousing. the data vault is, was, and will continue to be a serious driving force in the success rates of true enterprise data warehouses going forward.
data vault 2.0 brings with it methodology, architecture, modeling, and implementation – best practices, standards, automation and more. the ability to encompass and leverage disciplined agile delivery, and sei/cmmi, six sigma, lean initiatives, cycle time reduction, and proper build practices lead us to one day sprint cycles.
by the way, if you can’t get something delivered (in development only) through sourcing, staging, data vault, and all the way to a data mart, in a day, then you are not exercising agility properly… and most likely need to brush up on your skill set for data vault 2.0. these are all things that data vault 1.0 (cdvdm) does not, will not, teach – it cannot teach these concepts because it does not include them. it is all relegated specifically to data vault modeling (less than 15% of the overall value when separated from the whole dv2 system of business intelligence).
however, when the model is utilized in conjunction with the other components, true enterprise value can be achieved in one day sprint cycles. that discussion will be saved for another day.
as always, comments and questions are welcome, please post them below.
(c) 2017, dan linstedt, all rights reserved