#datavault #datawarehouse 6th normal form & DV

A few days ago I received some questions about Data Vault models, normalization, 6th normal form and so on.  In this post I will share with you the questions I received, as well as the answers I think apply to these questions.  If you have additional answers, please feel free to add a comment at the bottom of the post.

The questions I received:

A couple of weeks ago, a data warehousing expert said to me words to the effect of “Data Vault needs less storage because it’s hyper-normalised, it’s sixth normal form, and so needs less space than a third normal form warehouse”.
This has been niggling at me ever since, because it feels like gobbledegook but has a grain of plausibility and I don’t really understand what sixth normal form is.
In practical terms it’s irrelevant – it’s relatively easy to test storage requirements, IO bandwidth and CPU load by testing on real data, and I’ve done so.

There are two questions arising:

(a) Is sixth normal form smaller than third normal form?

(b) Is Data Vault sixth normal form?

There is a quick answer to question (a). Since all sixth normal form relations are also third normal form relations [Date 2003, quoting Date 2002] then it is clearly not true. Perhaps a better question is:?

(c) Is sixth normal form smaller than third normal form where the third normal form relation is not sixth normal form?

We can examine using the example given by Tim Warnock on his blog [Warnock 2011]:
Users
UserName
Department
Status
John Smith [10]
Sales and Marketing [19]
Active [6]
Paula Jones [11]
Shipping [8]
Disabled [8]

In this fifth normal form version we’ve used 62 characters.
The sixth normal form version:
Users
UserName
Status
John Smith [10]
Active [6]
Paula Jones [11]
Disabled [8]
Dept_map
UserName
Department
John Smith [10]
Sales and Marketing [19]
Paula Jones [11]
Shipping [8]

uses 83 characters. More generally, for this example, we use LEN(username) + LEN(Department) + LEN(Status) in the fifth normal form version and 2 * LEN(username) + LEN(Department) + LEN(Status) in the sixth normal form version. So, in this example, sixth normal form will always be larger (provided LEN(username) > 0, which it must be to be useful) regardless of encoding.

To answer the second question, is Data Vault sixth-normal form, we need to look at the modelling rules in Data Vault to see if it requires any particular design.

The way Data Vault would ensure no join-dependency inside a satellite would be to split the satellite into two, or to model additional hubs and associated satellites.

Hubs in Data Vault are [Linstedt 2011 p.58] “a business key recording device” and business keys are [p.59] “something that the business uses to track, locate and identify information”. That’s not a particularly helpful definition, because there are many things a business uses to identify information, for example time, and Linstedt’s examples do not create hubs for all of these.

Turning to satellites, Linstedt [p.112] says “How normalised a Satellite gets is a function of the design and a choice made by the designer”. It would appear the answer to whether Data Vault is sixth normal form is – it depends.
References
Date, C (2003) http://www.dbdebunk.com/page/page/621935.htm
Date, C (2002) C.J. Date, Hugh Darwen, and Nikos A. Lorentzos, Temporal Data and The Relational Model
Linstedt (2011), Super Charge Your Data Warehouse
Warnock, T (2011) Normalisation, 6NF http://www.anattatechnologies.com/q/2011/07/normalization-6nf/ accessed 26/5/2012

My thoughts:

for starters: 6th normal form is basically a “columnar based data store” – that’s an over-simplification, but it translates to mean: there are zero repeating data values in the entire model.

Is the Data Vault 6th normal form?

No.  The Data Vault if implemented in it’s physical form factor on a traditional table based database system (RDBMS), without column compression turned on, then no – the Data Vault contains repeating values, therefore breaks the rules of 6th normal form.  Especially due to the historical nature of data in the Satellites.

now, the whole thing is this:

  1. 1. if you run on a columnar database, then in effect, the data model and it’s normalization form don’t matter one bit.  The data is stored in 6th normal form physically with no repeating values – so therefore the argument or statement made for this case is false
  2. if you run on a traditional RDBMS (table based store), AND you turn on column compression for all tables, for all columns, then, in effect, the normalization of the data model once again, doesn’t matter.  Therefore in this scenario, the argument or statement made would again, be false.
  3. If you run on a key-value pair data store (like Hadoop, or vertica, or paraccel or green plum, then the data, in effect stores NO repeating values.  the Model becomes Logical in nature, so therefore in this scenario, the argument or statement made would again, be false.
  4. All of that said: IF you run a star schema data model on a relational database (RDBMS) without columnar compression turned on, AND you are using very wide type 2 dimensions, then it would take the same or more space than the Data Vault model with the same data.
    Why?
    because the star schema type 2 dimensions typically have greater numbers of repeating values held in a single dimension – especially IF the dimension contains hierarchies.
    NOW:
    if you put EITHER model (Data Vault OR star schema) on one of these columnar, NoSQL, or other appliances – then most likely they BOTH receive the benefits of “6th normal form” through compression of the data in the tables, without being 6th normal form models.

So, is the Data Vault 6th normal form?

No.  Not in its native state.  In order for the Data Vault to be 6th Normal Form, you would have to 1) remove all duplicate values from the columns – which means removing load dates, load-end-dates, record sources.  2) split every column in satellites in to their own table.

At that point, the Data Vault begins to look similar to an Anchor model.

Just remember this: any “data over time” with history, without some form of de-duplication (be it column compression, or storing single values) means that the model cannot reach 6th normal form.

Now, is a normalized model “smaller” than a denormalized model in terms of space?  Not necessarily.  again, it depends on the physical storage underneath, and if it uses repeating values, or houses data over time.

Once we get in to compression & columnar approaches, then data models become logical – eliminating the need to try to answer this question in the first place.

These are just my thoughts, hope this helps,

Dan Linstedt

Tags: , , , , , , , , ,

No comments yet.

Leave a Reply

*