consulting time

#datavault 2.0 supports dynamic data warehousing

it’s been a while since i blogged about features and functionality of the data vault, but here is an entry that discusses something i’ve been talking about since 2001.  dynamic data warehousing.  this is the next step beyond active data warehousing / operational data vaults.

first some references:

so as you can see, these ideas are not new.  i’ve intentionally designed the data vault model to be adaptable at run-time in production environments.  however, the technology and the capabilities have finally caught up with the designs and intentions.

now, today, you can (if you haven’t already) create an actual dynamic data vault / dynamic data warehouse (ddw).

the parts of a ddw

as a reminder, there are several moving parts to the ddw, and i’ve tried to summarize my previous articles by providing the following list / overview:

  • adaptable data warehouse structures
  • automatic / self-healing structures
  • real-time changes to source structures resulting in real-time changes to data warehouse structures
  • automated generation / addition of data mart structures
  • physical attributes accompanied by business names and metadata definitions – from source all the way to the data marts
  • detection of new arriving attributes
  • some form of “understanding” of where & how to automatically attach attributes to the ddw, and to the data mart layers

the basics of this, are: the source system is dynamic, whether the physical source is changing with physical structure changes, or it’s a source based on key-value, eav, or json or something else.  the point is, it doesn’t matter how the change originates.  what matters is: that the change originates in real-time in production and is driven by the customer.

the next basic tenant is that the physical changes are accompanied by sufficient metadata / business meaning / definitions, logical attribute names.  that metadata is then used during the “automation process” that makes the structures in the warehouse/data vault dynamic.  the automation process also takes a look at where best to place the attribute(s) downstream in the virtual marts / and/or physical marts – resulting in dynamic information marts downstream of the ddw.

now, the fact that all of this happens in real-time, and in production cannot be overlooked.  these structural changes, must be applied in real-time, in production to the ddw, to the integration and movement processes, and to the downstream data retrieval / data release processes.  automation and generation become key processes in the self-healing notions.

finally, at last, understanding how to release this in to the bi and analytical tooling layers – so that business users can access these changes immediately.  this, is the crowning achievement in a great ddw.

can any vendor do this today?

not to my knowledge, there is no single vendor focused on this specific space.  no single vendor addresses structural change detection, much less encompassing a neural net or statistical decision making capabilities for applying the structural changes all the way through to the bi / analytical layers.

but…  there is hope!  i cannot announce anything here, yet… except to say that i am currently working with at least one, possibly two vendors on building it on a client site.  more on that once i can release details – so stay tuned.

can i build this today?

yes, to a degree.  there are several different ways to make a ddw happen – but for the most part, these methods do not include dynamic information marts!  getting the new structures / new attributes to be picked up (even by managed self-service bi tools) is a huge problem today (as i mentioned above).

so how can i build this today?

well, first – you can hire me.  i’m not kidding.  you really need a guide to get you to the point of making this happen.  i’ve built about 10 of these over the past 12 years, and there are pitfalls, there are problems, there are issues to solve – even at the business level, not just at the technical level.  i can give you the lessons learned and the best practices around building this kind of environment so that your project stays on track in the most agile fashion.

ok, that said: to get just the technical pieces started you need / require:

  1. customer facing, real-time, production adaptability.  meaning that the customer using the source application makes the changes / adds custom fields, etc…  while you can make the argument: we in it change those things – i will disagree with you on the fact that it does not constitute “in production, dynamic structure modifications”.  yep, your source application (like salesforce for instance) needs to allow custom field definitions.  or: you ingest xml, json, or web-service based content that is dynamic in nature, yet still semi-structured.
  2. real-time delivery (or near real time delivery) to a relational database system or hadoop / hive based system.  real-time delivery using an esb / or message queue, can / should pick up dynamically created fields and attributes from the source, and be capable of adding to the payload and delivering straight to the data vault.  traditional fixed structure tools (etl / elt) etc.. that focus on batch do not cut the mustard here.  i do not consider these tools (whenever it must “make the change” to the structure, check it in, etc…) to be dynamic.
  3. data vault based data warehouse – either in traditional rdbms or in hive / hadoop.  turns out, in hive and hadoop, you can use json and xml and document satellite structures automatically.  they hook right in to the hubs and links that are already there.  this is the most dynamic adaptation of structure there can be.  in an rdbms, you will also need an automated routine (smart automation / neural net) that can answer the questions: is this a business key? is this a relationship? is this a satellite attribute – and which satellite does it belong in?
  4. an automation / code generation routine to generate structure changes, and send e-mails.  grading the change green, yellow, or red and responding appropriately.  this automation routine *should* also generate views on top of the data vault, including point-in-time and bridge tables as defined in the dv2 standards.
  5. virtualized information (data) marts.  the more virtual you are, the easier it is to refactor a view and dynamically add attributes – making it possible to release these new attributes through automation.  if you’re stuck in physicalized data marts, then you might have additional technical issues (indexing, partitioning, etc…) to worry about.
  6. metadata capture at the source – to include business metadata and business meaning.  it is absolutely critical to increase the confidence levels of the automated routines figuring out where and how to put the new attribute.  without this, it becomes a manual task (and hence, no longer dynamic).  that metadata needs to pass from the source, all the way through the ddw, in to the marts, and off to the bi tool on the end / the consumer.
  7. dynamically adaptable bi tool through an api call or calls.  we *must* be able to dynamically add attributes to the appropriate places in the bi / analytics tools, completing the source to target linage in an automated fashion.  this is not only the icing on the cake, it is the very definition of ddw.

so you see, there are quite a number of moving components, including a neural net in the middle (which introduces concepts i call metadata mining)…  not something you are used to i’m sure.  anyhow, i’m happy to help you get here, just drop me a line if you think this is valuable.

by the way, businesses who engage in this effort save tons of time, effort, and money in their entire edw / bi projects.  more on this later…

(c) dan linstedt, 2016

Tags: , , , , ,

No comments yet.

Leave a Reply