Thought Provoking Data Vault?

this is a research topic/research entry.  this is not a finalized product, nor a actual working model.  it is however something i have been researching and playing with in the “data vault labs”.  my friends are always asking me: what’s next? what’s after the data vault?  well, it’s not so much what’s after the data vault as it is: “what can you do with the data vault, how can you apply it in unique ways?”  the idea is this: the data vault is patterned off of a simplistic neural network, containing key-nodes, context over time, and associations or links.  i think of the data vault as the information that powers the neural net, and thus – one particularly interesting concept to apply is: how to make it “think”.

now by thinking i don’t mean: it acts like a human… no – too difficult given todays technology.  by thinking i mean the following:

  • it establishes associations previously unknown and scores them
  • it learns key phrases, key concepts, key ideas – in other words, it begins to recognize business keys on it’s own
  • it can ingest new content (similar in nature), and establish new neural net nodes, new associations, or integrate the new content with existing content

ok, these are the technical ideas behind the notion… what does it mean to business?  it can mean several different things –

  1. it can allow data warehouse models to “adapt” dynamically to new content.  to business, this means a seemless back-end data storage device that automatically absorbs a new feed, and integrates it with existing concepts.  bi systems would be able to query the new content on-the-fly after ingestion. 
  2. it would allow old/dead data (not used, not received anymore, not updated since… etc..) to be backed up, and automatically moved off the warehouse/out of the warehouse – keeping the backend quick and light.
  3. it would provide new insights into context that is “associated” where the business may not think the data sets are related, this is where the neural network kicks in.
  4. it would become a part of the business process chain, more importantly – automated adaptation.

now i ran a recent (very informal) survey of people in the tdwi linked in group about this.  i posed the question: what would you do with dynamic changes to your data warehouse model?  or something like this.  the negative reaction i got from it folks was not surprising.  many complained that a system like this: doesn’t have governance, that they could never implement anything like this because they needed to “own” the changes, test the changes, etc.. etc…

well, unfortunately i feel like these folks don’t understand what the future of business holds for them.  systems like this are already available in business, just not in the data warehousing world.  systems that dynamically change with the business can easily be found in the business rules processing (bpel) environment.  where business users can “change” business process flows of data in a graphical manner, designing their own data flows, source and targets – business to business interactions, etc…  the point is: the business is driving self-sustaining change.  business needs it, they want it to survive, and if the data warehouse doesn’t change, it will die off as an industry.

yes, we need governance, yes, we need to monitor what’s going on, but at some point – we have to perform guided processes on “neural networks” that make automated decisions.  why should the data warehouse industry be any different?

by the way, in case you haven’t guessed it by now…  this pertains to the ingestion of unstructured data in to the data vault on an automated basis.  let the neural network decide what the patterns are / should be, let the neural network figure out where the content maps to existing ideas or concepts.  let the neural network figure out how to talk to the results of the unstructured data mining application and tie them together.

now, how about making a data vault “appear” to think?  well, you probably figured it out – but the data sitting in structures is “dead” or lifeless.  it really doesn’t “think” by itself.  we need to couple these components with logic and a bit of understanding, and some algorithms.  this is where the neural network/fuzzy logic/artificial intelligence engines come in.  think about it…  what if you could teach a node that it contains information about cars.  these cars have certain characteristics…. these cars have “identity keys”, these characteristics change over time – but essentially the keys stay the same….

now, what if you could take these ideas/concepts and apply characteristic discovery to matching characteristics of a person?  what if you had two nodes (hubs + satellites)….  independent of each other.  the car says it’s blue, 4 door sedan, and somewhere along the line it received a characteristic that it has 149,999 miles on it.  of course, it’s also described by year, make, model, etc…   what if their is a driver/owner of that car (no link between the two yet) – but the owner uses descriptions of a car they drive, characteristics of a car they drive… these descriptors are captured around the person hub to describe the owner.

why is there no link?  because the person doesn’t want to be automatically connected to the car… or because the persons’ information comes from a different source that doesn’t identify the car, or any number of reasons.  the question is: can we find the owner of the car by scanning the information?

hmmm…  what if the neural net could scan the context (the satellites) for details about the person and about the car that seem to associate the car with the owner?  things like: where was the car last seen?  what town/city, what area of the city?  the address of the owner?  the proximity of the address to the “general location where the car was seen?”  what about “the color/make/model” of the car, are there any “close associations” found in descriptions of the car given by the owner in a document somewhere? fact

granted this is a very simplistic example, but imagine the power…  the neural network learns that certain nodes contain certain context, it therefore makes assumptions and creates new pathways (new links) that associate these kinds of information together.  it can then test the node for completeness, accuracy, and score it with a confidence rating… ie: how sure is it that these two, three, or four items (hubs) can be associated together?

the system appears to be “thinking…” doesn’t it?  it changes structure in a fluid manner based on and associative data that it finds within the data warehouse, or in some cases, in an external unstructured data source (which means it can accurately interpret language differentiation – totally different knowledge base/engine needed for this piece).  how powerful would the assumptions be to you in business?  what if the system e-mailed you every time it figured out a new association?  you then “taught the neural network/corrected it” by telling it “yes, this is good – or no, drop this association because it’s silly”.  the neural network then learns what your business preferences are, and adapts the structural changes better over time. 

you see, to me this is governance – round trip solution.  to me this is a “semi-thinking” system, one that tries to “relate concepts together” based on language ontologies, data sets, key concepts, and context mining.  to me this is the future of data warehousing in an appliance based component somewhere in the cloud computing land-scape.

this is where i believe we are headed.  this is what i believe we can do today! when i said: i’ve found a way to associate new elements to an existing model, i meant it…  maybe there’s more to this than meets the eye?

so, what are the hurdles for getting there?

  1. a person has to build the foundational knowledge (the data vault model)
  2. a person has to teach the initial neural net what kinds of things are good, and give it some basis for associations
  3. a person has to (currently) run the unstructured data integration engine externally to ingest the data into the warehouse
  4. a person has to establish the “process” by which data arrives in the warehouse in real-time
  5. a person has to … you get the idea, there’s a lot that we have to do to set this up, but it is possible.

have some ideas around this topic?  i’d love to hear your comments.

dan linstedt

ps: want to know how to  make a system like this?  i’ll share my thoughts with you in the coaching area…

Tags: , , ,

No comments yet.

Leave a Reply