i’ve heard a lot of rumblings in the market recently about hyper-normalization, at least that’s what some are calling it. i’ve also heard that there are classes offered at tdwi around hyper-normalization techniques which really are truly just simply normalized data modeling based on data vault. in this entry, i will discuss my view points on hyper-normalization.
some in the industry are using a term called “hyper-normalization”. before i go any further, i went looking for a good definition of what this truly means. unfortunately none exists. so i asked my friend cj date, and he said: “no such thing exists.” in my opinion, if the man who helped establish normalized formats for data modeling says, no such thing exists – well then, i tend to believe him.
i agree with cj date, i don’t believe that there is “such a thing as hyper-normalization”.
to understand where i am coming from, let’s examine a definition of the root word: hyper
now this term: hyper-normalization has been used to describe data vault modeling. why? data vault modeling really sits between 3rd normal form and 4th normal form. if i were to correctly apply the label “hyper” to data vault modeling, then i would be saying that data vault modeling goes beyond 6th normal form. which clearly it doesn’t even come close to that level of abstraction or exaggeration.
so, is the data vault “hyper-normalization”? no. definitely not. are forms of the data vault model “hyper-normalized”? no, definitely not.
let’s go back to the definitions of normalized data modeling formats:
- first normal form (1nf) is a property of a relation in a relational database. a relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.
- second normal form (2nf) is a normal form used in database normalization. 2nf was originally defined by e.f. codd in 1971. a table that is in first normal form (1nf) must meet additional criteria if it is to qualify for second normal form.
- the third normal form (3nf) is a normal form used in database normalization. 3nf was originally defined by e.f. codd in 1971. codd’s definition states that a table is in 3nf if and only if both of the following conditions hold: the relation r (table) is in second normal form (2nf), and every non-prime attribute of r is non-transitively dependent on every superkey of r.
- fourth normal form (4nf) is a normal form used in database normalization. introduced by ronald fagin in 1977, 4nf is the next level of normalization after boyce–codd normal form (bcnf).
- fifth normal form (5nf), also known as project-join normal form (pj/nf) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships.
- a relvar r [table] is in sixth normal form (abbreviated 6nf) if and only if it satisfies no nontrivial join dependencies at all — where, as before, a join dependency is trivial if and only if at least one of the projections (possibly u_projections) involved is taken over the set of all attributes of the relvar [table] concerned.[date et al.] sixth normal form is intended to decompose relation variables to irreducible components.
** all definitions provided by wikipedia **
now, the hub entity:
if we remove all the system fields (sequence, load date, record source, last seen date), then the hub is just a unique list of business keys. which means, it qualifies as 6nf (sixth normal form) – irreducible components.
the link entity:
if we remove all the system fields (sequence, load date, record source, last seen date), then the link is just a unique list of relationships. which means, it qualifies as 5nf (fifth normal form) – multi-valued facts by isolating semantically related multiple relationships.
the satellite entity:
we cannot remove the system fields, and still maintain data over time – hence, the satellite is: descriptive data over time. which means, it is 3nf (third normal form) – where every non-prime attribute of r is non-transitively dependent on every superkey of r.
so, where does that leave us? if we add the system fields back to the hub, the hub then moves to be 3nf. if we add the system fields back to the link, then the link moves to be 3nf.
none of these data modeling components are “hyper-nomalized”. sorry, but that term simply cannot be expressed mathematically, and quite frankly makes no sense mathematically.
the next question is: hyper-generalization, what exactly is that? well, i disagree with that term as well.
hyper-generalization (as defined by some in the industry) is this notion of “super-super type modeling”, well again, i think this is just a marketing term that should be “thrown away”. i tend to agree (here) with another well-known friend of mine: david c hay, who states that the proper term is: “universal data model” – which really is the “thing” model.
therefore, in conclusion:
1) hyper-normalization? no such thing exists.
2) hyper-generalization? no such thing exists.
“industry conferences” are perpetuating incorrect instruction in the data modeling space by allowing presentations to be made that continue to enforce or suggest the existence of these terms within the data warehousing industry.
furthermore, what some individuals are calling “hyper-normalized data warehousing” are just in fact, truly just “data vault patterns” based on standard normalized formats.
i’d be curious to know what you think?
hope this helps to clear the air,