Column Based Data Vaults #datavault

column based data stores have been around for a long time.  this blog will talk about what you need to do to make the data vault successful on a column based data store.  it’s quite simple really, and in the end – some parts of physical data modeling don‘t matter in a column based data store.

if you want a full on-line lesson on this and other topics, you will soon be able to take the class on – sign up today to be the first to be notified wen my on-line classes go live.

what is a column based data store?

a “database engine” which is vertically partitioned.  in other words, each column is it’s own table (for a simple definition anyhow).  you can think of this as “close to” or “relative to” 6th normal form.  the internal surrogate key attached to the column is used for join purposes, and is generally not seen / not viewed by the end user.

is a data model necessary in a column based data store?

yes, a logical data model is always necessary – so that you can understand how your data is managed, stored, and tied to the business.  we (humans) require classification systems (ontologies and hierarchies) to understand data sets – this is where data modeling (domain modeling) comes from.  that said: physical data modeling for a column based data store does not matter other than the data type and null/not null specification (perhaps also a default value and / or a range constraint).  what i’m referring to that doesn’t matter – is the structure and the foreign keys.

whether a column is or is not in a particular table, and whether or not it has foreign keys, makes no difference to the manner in which the column based data store stores it’s information.

ok – yes, foreign keys are important (implementation wise – the column based data store does enforce them), but a column based data store automatically vertically divides each table in to it’s constituent parts.

what matters in a physical data model and a column based data store?

the following items matter:

  • data type, length, precision, scale
  • null or not null specification
  • default values
  • constraints
  • function based columns
  • foreign keys

what’s this got to do with data vault?

the point is this:  whether you use data vault, 3rd normal form, 1st normal form, 6th normal form, anchor modeling, key-value pair modeling, or dimensional modeling, ultimately the only thing that matters in the column based data store is how you logically manage your data sets.  the data modeling method you choose literally, doesn’t matter beyond the benefits of flexibility, scalability, maintainability at the logical level.

if the data vault model suits your purpose for managing data, then use it.  remember this: with a column based data store underneath, what matters most in the project is the methodology chosen for implementation.  the data vault methodology provides you with a set of guidelines for standards, re-work, scalability, measurement, quality, project planning/tracking/oversight, and automation that you simply won’t get any where else.

in the end…

the two biggest problems that customers of mine have (why the “move off column based data stores in the future”) are as follows:

  • lack of understanding / proper organization of data – people over time have “just added” another column to the database, without taking the time to understand what the column represents, why it’s there, how it’s loaded, and if it’s a duplicate.  this leads to a data junkyard – tables with 1000 columns or more exist, with 150 or more duplicate columns, or slightly different columns where the business doesn’t understand it’s data asset anymore (column based databases lead to a laziness for managing the data assets properly).
  • scalability – when the column based databases have reached 80 terabytes in a single node, they tend to break down.  we’re not talking 80 tb raw here, but 80 tb max capacity, compressed data and everything else.  the customers give up, and have to move back to a traditional rdbms in order to scale.

so, if your customer can “mitigate” the data understanding by rigorously following data modeling standards, and they won’t reach 80 tb (about 300 to 350tb raw) data, then you can choose to use a column based data appliance.

but regardless of the technology, the data vault model and methodology still stand, and still provide benefits to the overall project because of the standardization, automation, and lower maintenance costs that it brings to the table.

what are your thoughts?  please reply below.
dan linstedt


Tags: , , , , , ,

No comments yet.

Leave a Reply