for the past 2+ years i’ve been working with sanjay pande. we’ve been addressing the question: what happens to a data vault model when nosql is involved as the platform? in this entry i will address this question at a management level. if you want to know more about the details, you’ll have to attend this years’ conference: http://wwdvc.com
first, let’s understand how i define nosql
nosql – in my definition really means: not only sql. but there is another definition out there which is also valid: truly “no sql interface available”.
i once wrote c++ libraries to access xerox documentum servers (one of the first nosql data stores??) it had several basic function calls: (among them were): get and put. (there were a few others). the point is, the technology that removes common accessibility standards like sql (ie: structured query language), makes it difficult for business users to access the raw data, let alone the historized data, and even the information (if there is any to be had).
anyhow, most nosql systems are adapting or writing some form of sql hybrid engine on top of their solutions to allow “common language” access to the data sets inside, without writing programmatic code (aside from sql itself).
second, the true business problems:
but before we get on to this notion, i’d like to take a minute and say this: what are the business problems we are always trying to solve?
- data integration – data sets from multiple disparate systems, integrated by some common thread.
- data historization – the storage of the raw data as-it-arrived (for auditability sake), and stamped with at least, one time stamp of arrival.
- information delivery – how to take data and turn it in to information from which business (people & sometimes artificial intelligence or statistical algorithms) can make better decisions.
that said, under each one of these major headings – businesses and it alike are attempting to constantly improve the landscape. that is to say: tools, storage mechanisms, ingestion rates, management principles, analytics capabilities, and so on. i would go on to say that there are in fact, two additional business problems that we are attempting to solve today:
- information historization – this is a critical function of business going forward, and it is different than data historization. the question for you (the reader) here is: “what do you do with the data that the business user has altered, and then wants to share within the enterprise?”
- information integration – this is different (again) from data integration. once the data has been re-aligned, corrected, massaged, altered, and turned in to information – how should it be integrated, where in the organizational hierarchy should it be integrated, and how can it be coalesced for a master data view of the enterprise world?
why is this important?
well, to be quite frank people offer all kinds of hoopla and hype about nosql as well as bigdata. in all honesty the following should apply to all business leaders who have an interest in utilizing these labels:
- nosql is just a categorization for platforms that store data sets. it is not a system of business intelligence nor is it a data warehouse.
- big data – oh come on, really? it’s just volumes of data in motion. ok, variety is a driver for volume, but let’s face it… big data is simply more data in motion (volume and velocity combined). one without the other is quite simply lots of data at rest, or lots of tiny transactions in flow.
at this time, i want to take a segue to discuss the new buzz word: internet of things (iot)
in reality, data is just data – and the iot is simply bringing more data to us from devices. to put it quite bluntly iot is boring… why? because a) it’s just more devices producing more device logs that can be consumed by what?
oh wait, that’s the top three problems i mentioned above… yes, it’s the same old business problems all over again, only this time, we have to absorb more data… faster. i’ll write more about the iot in upcoming posts. for now, let’s get back to nosql.
underneath the nosql category there are three basic sub-categorizations of data storage and retrieval engines: a) relational & semi-relational, b) hybridized (relational & non-relational), c) completely non-relational.
note: what am i doing in this article? hint… turning unstructured data (words and text) into hopefully meaningful information (concepts and organizations of ideas)…
below the covers of the nosql label
underneath each nosql label, are the data storage mechanisms. again, each entry / platform in this space can be divided in to the three categories mentioned above. let’s explore the relational nosql category for a minute…
conceptually: relational nosql basically says – there is a relational method for accessing and changing the data stored underneath. for instance: hiveql on top of hdfs and a key value physical storage mechanism, is an example of a semi-relational access pattern on top of nosql store.
what this means, is: you (the business user) need to know and understand what data is stored. you need to decide (just like always) what you want to do with the data that you retrieve, and lastly – you are the data mining engine that turns the data in to information. either by categorizing it, aggregating it, cleansing it (performing business rules on it), filtering it, joining it, enriching it and so on.
it becomes information when a business user or pre-programmed business rules perform operations on the data, munging it and changing it in to something understandable and usable by business.
conceptually: non-relational nosql basically says: there is no relational paradigm for accessing or representing the data that is stored within it. for instance, mongodb “tables” are like a single worksheet in excel – with thousands of columns, no way to join the data sets together. any data that needs to be combined, needs to be flattened and replicated across the storage mechanism. a “single query” accesses only a “single table” to produce results.
in some of these particular cases, flat & wide are the only way to go. it’s conceptually similar to a cobol copybook defining a single file store with occurs clauses and repeating groups, without any of the conditional logic for overlays and redefines.
conceptually: hybrids offer the best of both worlds through libraries, and access points. this means, the technology or the platform is deciding what to do with the data you pass in.
what does this all mean to data vaults?
well, remember, this is a high level discussion – to find out more about the details, you need to attend the conference: http://wwdvc.com
anyhow, data vault models are canonically organized. which means they adhere to a common set of principles. the modeling constructs are focused on business keys, but more than that – they are focused on creating a hierarchy of business keys – where the hierarchy can constantly change and be adapted to the business needs, without re-engineering the rest of the solution underneath.
in other words, data vault modeling principles are logically based and founded in conceptual constructs.
ok – too much gibberish. data vault models are close to the lowest level of a concept model that you could construct based on a set of ideas in your business that are inter-related by business keys (key business terms you use to describe the data in your business uniquely). for example: customer account, portfolio number, stock ticker symbol, company name, and so on.
data vault models can be leveraged in nosql platforms easily, as long as they remain at the logical level. the physical data model needs to change depending on the physical storage mechanisms housed within the nosql environment. some nosql platforms will actually change the model to accommodate the physical storage mechanisms under the covers.
it means, even if you are putting your data vault models on nosql platforms, you should still focus your efforts on understanding the business and therefore understanding the business keys, along with organizing business concepts by hierarchies.
why should i as a business invest my time in modeling at all?
again, claudia imhoff will address this issue in her keynote address at http://wwdvc.com
my two cents (to wrap up this article) would be as follows:
- you can’t understand what you don’t define. unstructured data is just that – unstructured and “useless” until you ask a question / mine it, and run correlation on it – resulting in structured result sets. you need to look for patterns! define the context of what you want to work with.
- you need to organize the unstructured data / nosql data in to contextual hierarchies in order to understand what it is you have. you need to attach it to constructs in business and the business processes, so that the “data” can be turned in to information by enriching it with context.
the data vault modeling concepts help you turn data in to information while assisting in the solution to the major problems listed at the top of this article. i would encourage you to check out the data vault 2.0 principles and best practices, or at least to read more about the data vault modeling constructs by looking at my book: super charge your data warehouse (available on amazon.com, or here: http://datavaultalliance.com )
finally, key-value stores will change the way physical data is stored, and what your physical model will look like. same with document stores, graph databases, columnar databases, and bigtable implementations. do not discount the value of building a good solid business model based on data vault principles and centered around business key integration.
i hope this helps clear the air a bit with regards to data vault modeling & nosql. as always, please feel free to add your thoughts and comments below.
(c) dan linstedt, 2015 all rights reserved