solved stamp

Canonical Modeling, Ontologies, #datavault 2.0 and #noSQL

in this entry, i will explore (at a management level), what these components are, why they are important to you, and how they can help you with your nosql and bigdata implementations.

what are these components?

let’s explore each as we go.

canonical modeling:

a canonical model is a design pattern used to communicate between different data formats. a form of enterprise application integration, it is intended to reduce costs and standardize on agreed data definitions associated with integrating business systems. a canonical model is any model that is canonical in nature, i.e. a model which is in the simplest form possible based on a standard, application integration (eai) solution.

how about one specific to canonical data modeling?

the canonical data model is the definition of a standard organization view of a particular subject, plus the mapping back to each application view of this same subject. the standard organization view is built traditionally using simple yet useful structures.


in computer science and information science, an ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse. it is thus a practical application of philosophical ontology, with a taxonomy

data vault 2.0 (data model):

the data vault 2.0 model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. it is a hybrid approach encompassing the best of breed between 3rd normal form (3nf) and star schema. the design is flexible, scalable, consistent and adaptable to the needs of the enterprise.  data vault 2.0 data model exchanges sequences for hash keys (as primary key) in order to increase parallelism, and allow linking to external data stores (nosql included).

why are they important?

we build data warehouses not information warehouses.  the reason is simple: data needs to be auditable, therefore: stored in raw format organized by historical dates (temporal).  while data can be important for business decisions (ie: discovering bad data – outliers), it needs to transition in to information to be useful to the business.

what is today’s most common platform for ingesting raw data in a historical sense without regard to format or model?  enter nosql…  nosql (relational, semi-relational, and non-relational) have wonderful ingestion capabilities (semi & non-relational can ingest schema-less: meaning without modeling).  they can almost accept any data in any format, stamp it with a date/time of arrival for historization, and split the data across parallel computing devices to make use of parallelism and high scalability.

before you can “make sense” of any data, you (or your team) need to organize it.  that means: arranging it, categorizing it, classifying it – generally known as a data model. then, the next step is applying business rules or context in order to alter, change, correlate, condition the data and turn it in to information. which of course means: your “target model” needs to be focused on the business.

why is this important?

because you cannot use, apply, or even understand data without associating it, and tying it to context or meaning.  in other words, without applying a model to it, and additional processing rules. for example, can you tell me what this data means?  ff 2a 99 1e 55 42 6b ac ?  even if i were to put it in ascii code / string, it still wouldn’t make sense.   that’s because there is no “model”, no context to view this data with.

now, if i were to say to you: this is the first set of bytes leading to a jpeg (it’s not, but just humor me)… then you would say: ahhh!  now i understand, it’s an image of some sort.  if i were to say: it’s the first set of bytes to a zip file, you would say: ahh!  ok, it’s a compressed binary file of some sort.  all of the sudden, when i apply a model, you have some level of context for understanding the data set.

now that you know it’s an image (for instance), can you tell me what it is an image of?  no.  now the real work begins, taking the data and interpreting it – changing it to a format (in this case) consistent with a visual display.  let’s say it’s an image of a frog.  great, i just gave it context.

why is canonical modeling important?

can you tell me what class of vertebrates the frog belongs to?  hmmm…  amphibians perhaps might be one possibility for “parent” in the hierarchy.  but to be honest, there are different species and sub species of frogs and without additional visual identifying characteristics (think: data mining, or statistical analysis here), further classification in a canonical model is next to impossible.

ok – a canonical model can help us represent this class of vertebrates inside the context of our business.  that’s right, it’s a move from data to information by applying a canonical model.   what do you need to do to the “data” once you’ve identified it?  yep, in the case of images or video, we generally tag them.  ahh, add metadata?  yep.  definitional context – or metadata tags, so that repeated searches will work, and we don’t have to “parse” the image again to find out what kind of frog it is.

wait a minute, how do we identify this “thing”?

most of the time, the next step is uniquely identifying the “file” or set of information with a meaningful business key. the business key “tag” – as long as it’s unique, will allow us to track the data as it passes through the business processes, associating value and enrichment of the data and information as the business performs “work”.

why are ontologies important?

ontologies provide us (in i.t.) with a method for formally identifying metadata.  ok – in english please…  we can specify a word-doc outline, consisting of key business terms, their parents, peers, relationships / associations – resulting in categorization and hierarchies of the data that we need to deal with or visualize for the business.

an ontology is a working map to the data landscape, providing context to the developers, designers, and business users on how the data fits in to the business.  you may have missed this earlier point, so i will repeat myself:

you must classify data in order to turn it in to information and make it usable by the business.  this means, in order to get value from your nosql stores, you must take the time to build some sort of canonical model or ontology – so that you can build requirements (business process expectations) for turning data in to information through business rule processing.  (think analytics here)

why is data vault 2.0 important?

data vault 2.0 (as expressed above) is a form of a canonical modeling technique. the data vault methodology provides the i.t. rules and procedures for building a hierarchical model that makes sense to the business, follows business terms, and establishes a working ontology (think hierarchy of business terms).  it provides the formal specification for building a map to all that “unstructured” data sitting in your nosql platform, so that you and every other business user can begin extracting value from these platforms.

for business definition purposes: the hash keys are a technical solution that allow us in i.t. to tie relational (traditional rdbms) systems to nosql systems easily, without re-engineering, and without re-building the entire warehouse, saves you time and money!

how they can help you with your nosql and bigdata environments…

to summarize…  if you have, want, or need a nosql environment, then you need to investigate, use, and apply canonical / conceptual modeling techniques to turn data in to information. data vault 2.0 modeling is a formalized process that will help you get there quickly.  data vault 2.0 methodology provides the standards, best practices, and automation recommendations that can allow you to seamlessly integrate nosql platforms with your existing relational dbms investments.  the methodology also provides i.t. with the rules and procedures for building and managing the data to information processes.

it’s not enough to simply have a nosql environment, and continue to throw data at it as it comes (while helpful for audit purposes, it has zero business value until you can classify the data inside).  yes, data mining (aka: deep analytics) and statistical engines will assist you in the discovery process, but to truly take advantage of data as an asset on the books, you will need to classify it, categorize it, and assign it context.

using the data vault 2.0 modeling techniques, you can focus on the business context via metadata tagging, and business keys – that fit easily in to the hierarchical format (ontology model).

if you are interested in digging in to this further, you can attend my conference: – where i will present much of this information (along with others, including claudia imhoff).  or you can contact me directly (on this site), and we can arrange on-site training in data vault 2.0 boot camp & private certification.

there is also on-line training available at:

(you can always leave me a comment)…

(c) dan linstedt, 2015 all rights reserved.  may not be duplicated, copied, re-posted in any form without express written consent from dan linstedt.


Tags: , , , , , , ,

2 Responses to “Canonical Modeling, Ontologies, #datavault 2.0 and #noSQL”

  1. Siva Janamanchi 2015/04/03 at 12:34 pm #

    Hi Dan,
    Great to see your recent posts on DV2.0 and NoSQL. Though I am new to DV, from what I have been reading so far, on DV2, I want to understand

    a. Some examples or case studies to understand how DV2.0 has helped enterprises to extend the EDW by integrating NoSQL and Hadoop data sets (DW2.0 ?). I mean what aspects of NoSQL and Hadoop data sets (or result sets ?) have been made part of DV models. This is very important to articulate the value proposition of DV2.0 with (such) Big Data Analytics platforms.

    b.I see some DV experts like Roelant Vos strongly advocating and exploring the case for Virtualizing DV and Data Mart layers. Many ‘Big Data Analytics’ vendors today are having tools to let business users/BI tools seamlessly (?) access data from any and/or across the RDBMS / Hadoop /NoSQL systems – like Teradata’s ‘Query Grid’ or Oracle’s ‘Big Data SQL’ – just with the help of SQL. This is like providing ‘Data Virtualization’ layer (thro views).

    Does this mean, DV2.0 model can be a virtualized EDW layer ? and also can be used to quickly build agile and resilient sandpit data models – for different user groups ?

    I know you will be talking this at WWDVC but I cannot make it and I am in urgent need of some direction and guidance on this.

    Please comment and provide details from real cases, if any.

    Thanks and Regards
    Siva Janamanchi

  2. Dan Linstedt 2015/04/17 at 2:49 pm #

    Hi Siva,

    Thank-you for your interest in DV2. I am pleased to hear from you about these things. I will be publishing a bunch more about DV2 in July 2015, soon, as my new book is released (which will be the standards reference & documentation in DV2 to use).

    Regarding NoSQL & Hadoop – there is more information coming, but at the moment, I am pressed for time – and simply don’t have the time to publish answers to the questions you are seeking. Again, sometime this summer I expect to be publishing. Probably on my subscription layers (which will be new) here on the site.

    regarding Roleant Vos, you will have to contact him directly. But, in the discussions I’ve had with him, we both agree that virtualizing your EDW layer only seems to work when the data set is relatively small, OR in-memory technology can be used for all the data in the warehouse or staging areas. DV2 model can be virtualized, Roleant will be discussing this at WWDVC. I am sorry to hear you cannot make it, I will be recording *some* of the presentations (those that I am allowed to do so), and selling the presentations on-line at

    Again, real-case studies are coming, but in order to have a full understanding, I highly recommend you contact my partner: Sanjay Pande – he now lives in Mysore India, and can assist in teaching classes in that region.

    Hope this helps,
    Dan Linstedt

Leave a Reply