The notion of “Thinking” Data Vaults?

It’s time for me to jump back in to the theoretical aspects, and consider some of the deepest roots of the Data Vault model…  That is: the natural world.  I’ve long held the belief that the Data Vault is modeled (albeit a poor mans model) after the neural images of what we believe our brains look like.  I have a hobby, as many of you may know, of reading and trying to understand the beauty and simplicity of the architecture.  Yet the architecture holds depth and complexity – or is the function that holds these things?  In case you’re wondering what I’m working on, this is a dive into the theoretical, the unknown (or my unknown as the case may be).

I’ve written it in the technical book, I’ve showed images.  From what I understand, the way we think is a combination of the form (the diagrammatic model/form of the neurons, dendrites and synapses) and the function of the brain.  Or as the case may be, multiple functions of different parts of the brain.  Some parts of the brain are said to house memories, other parts, images of parents, other parts are said to deduce fight or flight.  Of course there is the major separation of what we know: short term memory vs long term memory.

I believe, when we think, or are cognitively aware, we are constantly taking in input from our senses (touch, sight, sound, taste, smell, etc..), grabbing specific images, words, thoughts from the “memory banks” as it were, and then applying context to the memories.  Using these individual building blocks to form a consistent and cohesive thought, one at a time.  They say the brain is a relatively “slow” computer, but how then can it get to emotions, considerations, and feelings – or even complete thoughts so fast?  No one really knows….  but one thing is for certain: the brain, in all of it’s complexity, combines form, function, and content – in parallel.

I think that when we build systems in the Data Warehousing world, we are building primitive (very primitive) content stores.  Just like the brain, the content stores of a data warehouse hold data over time.  The brain tries it’s best to remember, categorize and index (if you will) information by time.  When you think of you’re 12th birthday, or your 8th birthday, these are both along the “birthday” index – or content/concept retrieval path.  Now, it’s a matter of “time” – as in WHEN did the event happen?

The next question might be: “what was the weather warm or cold?  or Did your cake taste good?

Of course, these questions are the questions that begin to lend context to the data or the information.  But I digress….

I believe that if we can build a system, that recombines, form (data model), function (retrieval, indexing, parallelism), and context (learning, neural networks, patterns of association, probability scoring) in a self-contained component (like hardware) that we can actually make a machine that begins to “perceive” things about the world around it.  I think the data model must resemble (in some way) the Data Vault, or to be more specific: a neural model, where the data is keye’d off of important events, and where it’s got hundreds of connections (if not millions) to other information around it.  The Data Vault carries Links for these purposes.  I believe that adding function to the mix is critical in order to make use of the data, know where it is, run the retrievals and updates in 100% parallelism, and of course finally, context.  This must be a combination of historical data, plus the “learning pattern” that is taught based on a finite world, along with teaching the “learning system” what the model truly is and how to leverage it.

The scientists say when we learn something new, we form new neurons, dendrites, and synapses.  When we connect or associate memories, the dendrites get thicker – the stronger the memory, the more vivid the memory, the thicker the dendrites.  They say that Alzheimers patients suffer from memory loss because these connections (these dendrites) deteriorate, the patients can no longer connect the proper memories to form context around their ideas.  They also say that neurons die off when not used, or when memory loss occurs.

All of these “features” of the brain make me believe that we can build a prototype of a perception system containing the Data Vault Model, and that the model (because of it’s nature) is best suited to dynamic alteration.  In other words, when the system learns new things, receives new inputs, etc… it can create Hubs & Links & Satellites on the fly for storage.  That the “stronger the indicators” and the “higher the confidence”, the more links can be associated with that information.  In reality, I believe that the Data Vault Model lends itself to the beginnings of a self-optimization pattern, that the model itself can & should morph automatically, or optimize according to the world around it.

Now before you go jumping off the deep end, or quoting some obscure scientific reference to me, please be aware that this is just a thought experiment.  So in keeping with this tone, if you have contributions or arguments against this, please voice them here by replying or commenting at the bottom of this post.  Also note: that I am not a brain surgeon, nor a neurologist, nor am I a cognitive scientist. I’m just an interested and curious computer scientist who dabbles in the theoretical possibilities of arriving at a dynamic and self-sustaining system.

Just imagine for a minute what it might be like to have a truly back-office self-healing (not self-aware), but self-adapting historical data store or memory, capable of “spotting new associations” for us, presenting those to us as a mechanism for review, and through that review or human interaction, we teach and guide the system to do better the next time…  What would that mean to you?  Is this even interesting?

Curious to hear your theoretical thoughts….

Best Regards,
Dan Linstedt

Tags: , ,

6 Responses to “The notion of “Thinking” Data Vaults?”

  1. Jonathan Shirey 2011/05/11 at 8:09 pm #

    Dan,

    This is very interesting. I think you are taking us one step closer to the “Singularity” (http://singularity.com/). With this in mind, I find it very interesting in the direction you are going with this…I am still trying to wrap my mind around how to get a machine to learn with confidence scores. There will still be a point where a “choice” will have to be made, and will a machine have enough factors to make a “good” one against the backdrop of multiple alternatives?

    Jonathan

  2. Victor Geerdink 2011/05/12 at 3:13 am #

    Hi Dan,

    First of all a Disclaimer, I am a computer scientist as well 😉

    I think a fundamental difference between the brain and a data vault is that they have different goals. The data warehouse stores organised links and indexes such as the birthday index you describe, while I think the brain works in a different way. In the brain it is actually another kind of trigger such as the smell of a birthday cake which makes us remember our birthdays. So i would say our brain isn’t organised in such a clear way as a data vault but is actually a chaotic system with random connections which are very individualistic. So maybe we should leave more room for chaos into our data warehouses and data vaults or find a way to introduce this is an effective way.

    Another important factor in the workings of the brain are emotions. When we form memories emotions are an essential factor in the way it is recollected. The most lively memories are the ones which have a strong emotional bond with. This function has served us very well in the past so it is interesting to see that we are trying to ban it out when it comes to Business Intelligence.

    As Business Intelligence specialists we might rely to much on order and reason. But in the end it is not about the data, but it is about the insights an organisation gets out of it. It might be that organizing it in a different way might actually produce better and more actionable insights.

    I think if we are going to move to more self-learning systems we can’t ignore learning more about how people make decisions and how emotions and uncertainty plays a role in these decisions. Only then we can find a way to present the data in a effective way that really supports the decision process. The way the data is presented might even have to adapt to the person who is interacting with it. Different users might be sensitive to different kind of arguments and connections. Besides that the same information might produce different responses in different people, so we might even want to present certain type of information to people who can use it for the most effective way and don’t show it to people who might react in a unproductive way.

    But just as you say these are just some theoretical thoughts ;).

    Cheers,
    Victor

  3. dlinstedt 2011/05/12 at 4:48 am #

    Hi Jonathan,

    The simplest form that I can think of for automated adaptation can be done as follows:

    a) a new data element arrives during the night, and is attached to an existing feed or transaction
    b) the engine has no clue about the structure, but the nature of the element (non-unique, some null values, and possibly a string field over 500 characters in length) contribute to the potential for this field to be descriptive
    c) the business keys of this feed have already been identified in the model, the “ETL mappings” have already been identified
    d) the new element’s field name is abbreviated with similar contstructs as a group of descriptions existing in a Satellite

    the result? the probability model can easily determine that this new element is most likely part of an existing Satellite.
    The confidence rating of the neural net analysis (of the METADATA coupled with a small mining activity in the DATA SET) leads the confidence score to be 90% that it should be in the existing Satellite
    The STRENGTH rating of this metadata, in relationship to the other Satellite data (based on ontology, model structure and existing ETL) tell the engine that the strength of this relationship is 95% to the other metadata

    The neural net engine can then (with confidence) place the new element dynamically/automatically in to the downstream Data Vault.

    This is what I call: Dynamic Data Warehousing, or: Dynamic Data Vault Modeling – where the score & ranking & confidence are placed in to ranges: GREEN, YELLOW, RED.

    Green indicates “safe to do, high confidence, high probability, high strength” – does it automatically, but still sends an email to the human for guidance. The YELLOW range indicates, it thinks it’s safe to do, but the confidence has dropped off, or the probability it has computed has dropped off (strength remains high). The RED range indicates that all scores have dropped out of range, and it is making a best guess.

    Undoubtedly the human, after receiving the email, will have to make a decision to allow, or not allow the action. By not allowing the action, they will be TEACHING the engine what to do with a similar situation next time it is encountered, thereby improving the next decision it makes.

    Given a FINITE set of knowledge (such as within a line of business) the eventuality of this engine learning nearly all situations, it can begin to re-focus it’s efforts on “watching the model itself”, and “mining all the time” both the metadata (structures) and the data set (coupling the results), and then morphing the model. Proposing new links, new scores, and optimized model changes. Potentially it can also “watch” the queries fired at the databases, and mine the query metadata for joins, and usage patterns – thus “morphing the model even faster” according to the questions it’s being asked.

    This is a function of the form and content of the structure of the Data Vault.

    There is always a “choice”, but introducing a learning system to make those choices means it is teachable.

    Cheers,
    Dan L

  4. dlinstedt 2011/05/12 at 5:05 am #

    Hi Victor,

    Thank-you for the comments. But, I must respectfully disagree. Different goals? How can “memories in the brain” be different from the storage of facts in a warehouse? is a “smell” really truly different from “data of the chemical makeup of an atomic structure that makes up that smell?” Is an image seen by the eyes not equivalent to a digital replication that a camera would take? Is the sound different than MP3 file that stores the data to recreate that sound? I personally believe that the influences of my hobbies have impacted my design significantly, and on purpose. The data warehouse does not have to be “organized” per say (it is my belief), but I think where I disagree with you, is that I believe the data / memories in the brain are truly organized, hierarchically organized – and to boot – the brain uses keys in an ontology for searching in parallel. I believe we think in categories and hierarchies (in parallel of course). I think statements about context are what make the business keys so valuable.

    I think these ontology’s and categorizations are what make the brain able to “apply context quickly, as well as search for what we need.” I believe however that technology has a LONG way to go to catch up – we have millions of connections in the brain between neurons, and yet, in the “database world” we complain constantly about the number of JOINS! (this really makes me laugh). Off-topic: when was the last time anyone needed to query the ENTIRE data warehouse joining ALL the tables to answer ONE question? Come on… Number of joins is superflous and doesn’t matter. In the human brain, I can’t think of any one thing that “lights up” the ENTIRE brain all at once except death. (at least from the Scientific American studies that I’ve read).

    However, we are all entitled to our beliefs, and your proposal of non-organization is interesting. I would be curious to “pit the two approaches” against each other in the learning / dynamic environment that I have described, and see which one fares better. Perhaps, closer to chaos might be the Anchor Model – less depth, less organization, and more focus on attribution. I think the Data Vault is geared to be organized and contextuallized by the hierarchies that naturally exist according to content and topic.

    I believe emotions are important, but are descriptive data and would be well suited to storing “How you felt” at that time, in a Satellite. Possibly even a case for “reference data”, seeing as how there is a finite list of emotions we have, but yet it describes how we felt, and assists in determining what do to when we retrieve that data.

    I do agree with you on order and reason. Insights are there to be gained, and yet again – this is where not enough work has been done to USING ontologies WITHIN a neural network to apply context and meaning to the data sets in our data warehouses. I believe there are probably 50 or more GREAT ontologies that can be used to organize any one single line of business, this is where the knowledge is, this is where the different insights and different discoveries will take place.

    No, we definately can’t ignore learning more about how people make decisions, but I think in the data warehousing world, we need to start by paying attention to classification and categorization FIRST – that will be the key to the method that unlocks some of the secrets of dynamic restructuring. I also believe that it will take more than one ontology at the same time in order to get it right… For instance, one hierarchy of “emotional content” (data of interest), and one hierarchy of metadata categorization – line of business knowledge (let’s just say naming conventions), and possibly one hierarchy of datatypes – these three ontologies together would make the learning engine fairly bright to spot answers of importance to us. NO – I AM NOT SAYING IT’S A THINKING MACHINE, I AM SUGGESTING THAT IT CAN COMPUTATIONALLY DECIDE WHAT WE SHOULD BE LOOKING AT BECAUSE WE EXPRESSED INTEREST IN SPECIFIC AREAS.

    Very good comments, I hope there are more to come.

    Cheers,
    Dan L

  5. Victor Geerdink 2011/05/12 at 6:26 am #

    Well i think this comes back to a philosophical discussion whether the world is subjective or objective. I think a memory is different then a fact in a table in the way that we reconstruct memories when we remember them. That is why memories can be easily changed and manipulated with hypnosis and other recall methods, while a fact in a table (such as a atomic structure) is something that is once registered and stays the same over time (when you recall it 100 years later it should still contain the same information).

    I do believe that people think that their brain is hierarchically organized and that their thought processes are rational and causal. But I believe that it is not because it actually is that way, but that we as humans just can’t accept illogical and irrational thought processes and try to fill in the blanks ourselves to make our story and memory coherent even if it wouldn’t fit the ‘objective’ reality.

    Next to that the digital image might be the objectively be the same as what the eyes see, but I might be focusing on the lovely view of the sun in the picture while somebody else would look at the people who are smiling on the picture. We would both have completely different memories when asked what was in the picture.

    When i think about joins in the brain I don’t think about joins on a specific key value (such as primary and foreign keys), but more in the sense of a neural network where we have lose association between all kind of information. Where one piece of information might have an association with everything else in your data set, this connection might even be with variables you haven’t collected yet. The main challenge here would be to identify what kind of connections are there and how you would value them. Some connections might even be a 1-1 connection, but i think the changing context such as time would probably affect the certainty of a lot of the data (is the relationship between the sales hot dogs and beer in 10 years still valid or do we degenerate this relationship over time, how does new data affect this relationship, does it further strengthen it or does it weaken the relationship?) . This off course could be one of the 50 ontologies that you describe. But i do think we should probably look further than just the hierarchical method of classifying data. The hierarchical method has been very valuable, and still is, but if we want to make the next step in information retrieval from large data sets, we should probably look at different ways to classify ontologies as well (lets not forget the poor platypus, who has been done a great injustice over time).

    I have no idea how this should be modeled, but it certainly is a radical different approach from the structured tabel approach. This would also require a whole different approach then the relational database approach that have served us so very well the last 40 years. I do believe it time for a paradigm shift, some radical different way of thinking about this. But to be honest i have no idea where to start searching for different solutions ;-).

    Cheers
    Victor

  6. dlinstedt 2011/05/12 at 6:46 am #

    Hi Victor,

    Great ideas, and cool explanation. I agree with you, there is no “sure-fire known way” to start any of these efforts, which leaves the ground un-touched as it were. My efforts are to take baby steps in that direction, and if something doesn’t work, tear it down and try something else. Now, simply on a point of conjecture, I would argue that “interpretation and conceptualization” is what we know to be thinking, and that the data (as it is facts) is basically stored in the brain much the same way facts are stored in the Data Vault. My belief stems from the following cases:

    1) When someone “is considered crazy”, or has “lost the grasp of the real-world around them”, or “can’t reason within the boundaries of what we call reasonable” – that they still have memories, images, thoughts, and facts floating around in their heads, it’s just their brain is mis-wired in some way. Scientific studies have proven this to be the case, the mis-wiring leads their thought process to pull together data that doesn’t make sense to the rest of us, and then to apply context based on their own life experiences, and sometimes it’s made-up. I think the Data Vault is just the structure, but it is clearly lacking in the functionality of thought processing (form and function and conceptualization) are all consolidated within the brain, which is one of the reasons why no-one understands HOW we think. On the other hand, when these types of individuals think of the “same thing” over and over again, the same areas of the brain light up again and again, which leads me to believe that it is NOT a random storage of information.
    2) I think we have illogical and irrational thoughts all the time, it’s a matter of the function and conceptualization of those thoughts (generally called thought capture or guarding your thoughts) that keeps them from springing forth. I think a prime example of this, is when I play with my kids, particularly when they were zero to 6 months old, or when I make up stories when their 3 and 4 years old, to not only teach them, but to interact with them. I used to tell them “the sky is purple”, after they learned the sky is blue, to see if they would “correct me”, it took them roughly 1.5 years to begin to correct me, to figure out what “sarcasm is”… Further studies and research have shown that when a subject is shown a “picture” or told to think about a specific thing, it doesn’t matter WHOM the subject is, the same area of the brain lights up to think about the “same subject” across different people. This leads me to believe that our brain is more structured and more organized that we actually think it is.

    Sure there are random connections made and broken all the time. Random connections are a result of the learning process, breaking those connections is a thought or contextualization process – this is where our brains are different from person to person. The connections tell the story of “perception based on experience”, where the same area of my brain would light up (as yours) thinking about a car for instance, the way I think about the car itself is different than you, so in that regard, the dendrites and synapses (connections or links) that I have between my neurons are different than yours, lending to differences in opinion.

    This is where I begin to “step off the edge” and say: I believe that if CONTEXT can be tied (meaning/association, place in a at least ONE hierarchy) directly to linking structures, that we can make some serious advancements in the nature of “learning systems” and dynamic data models. Each new “governing or overriding” ontology/hierarchy would lend to new linking structures, and not necessarily new Hubs or new Sats…

    But then again, this is just my interpretation of the facts (as it were, all puns intended),
    Dan L
    PS: Great conversation….

Leave a Reply

*