welcome back to part 3 in the series. in this entry i will begin comparing and contrasting star schema modeling to data vault modeling – from a data warehouse perspective, not as an end-user data mart accessible model. i’ve said it before, i’ll say it again: star schemas, cubes, and flat-wide tables are the best for delivering data directly to end-users. the data vault is the best for your data warehouse environment.
let me back up…
first and foremost, i separate the data warehouse from the data marts. my definition of data warehouse follows bill inmons’ definition of data warehouse: reliable, time-variant, consistent, integrated, etc… where i define data warehouse differently than the industry is: i believe a data warehouse should have the following additional characteristics:
- easy to architect
- easy to build
- easy to extend (flexible)
- easy to scale (# of tables, and in data size, and in loading patterns, and in querying patterns)
- easy to load – simple, pattern based, repeatable loading processes
- non-queryable by end-users
so where are the differences between dr. kimball and you?
well hey, something’s got to give somewhere, and today it is near impossible to make a model that is both easy to query by end-users, and have all the other characteristics that a warehouse needs! there are two fundamental differences between my definition of “data warehouse” and dr. kimball’s definition of data warehouse:
- dr. kimball believes that the data warehouse is a logical collection which includes the staging area and a multitude of star schemas or data marts (federated or not).
i believe the data warehouse is a physical element that stands on it’s own, has it’s own architecture and has it’s own data model between the staging area and the data marts.
- dr. kimball believes that the data warehouse should contain consistent, quality cleansed, and business aligned data. i believe that is not the job of the data warehouse, but rather the job of the data marts.
i believe that the data warehouse is a raw system of record that is integrated by business key, but that is all.
so why the split? doesn’t it introduce yet another layer of storage (yals)?
some might see it this way, and physically speaking, yes – the answer is yes of course it does. most people automatically jump to the wrong conclusions at this point. unfortunately because of these false beliefs and incorrect assumptions, the data warehouse never “grows up”, never makes it to the “standardization” area, never reaches scalable patterns and so on. why? because people insist on banging their head against a brick wall!
- they know that a certain level of star schemas (federated) won’t grow or scale beyond a certain point.
- they understand that adding true real-time feeds along side the batch feeds causes problems with star schemas (empty dimension records, updated dimension data, missed key structures, mismatched key structures, etc…)
- they feel the pain of trying to change a non-standard process 3 months in to production
- they’ve been in a postition where they have been outsourced
- they see the business users create their own solutions
and yet, even with all this (and more), they still insist on creating federated star schemas as their data warehouse. they swear that it’s easier, cheaper and faster! then, 6 or 9 months down the road in to production, they smash head-long in to the brick walls above.
- they can no longer maintain the growth of the data warehouse
- they can no longer make changes to already too complex loading routines
- they can no longer add new systems, or new data to the already too wide dimensions
- they can no longer be agile enough or responsive enough to the business needs for new information
- they can no longer grow the system because of query performance problems
with all these problems, why do they persist in insisting that star schemas make a good data warehouse?
i really can’t answer that question because i’m not sure why. but i will say this, if you’ve still not yet tried the data vault model, then you truly are missing out. i would encourage you to build 2 hubs, 1 link, and 3 satellites (at a minimum), and give it a shot! what have you got to lose? nothing. only the world to gain, and if you do it right, it should bring you ease of use, ease of scalability, flexibility, and rapid build out of data marts along the way. if you end up not liking it after 6 months, then simply drop the tables – and no harm done!
of course, if you chose to not go the data vault route, i’d like to hear why you chose differently – what caused you pain or problems?
all that said, i still don’t understand why i should have yals (yet another storage layer)?
fair enough. let me ask you this:
in life, when you’re faced with a huge problem to solve, what’s the first thing you do in order to begin constructing a solution? if you’re like most people, you would have said: i break it down in to bite-sized managable chunks, so i can solve small pieces of the puzzle – in order to solve the entire problem.
why then, do you insist on solving the huge problem of data warehousing in one step? generally, everyone i’ve met who builds the data warehouse according to the old school of thought, likes to solve all of the following problems before the data hits the staging area: (all in a single etl loading routine)
- source system timing and availability (logon to multiple systems at once to fetch all the data in one read cycle)
- cross-system dependency and sourcing (filters, joins, aggregates)
- business rules – mid stream, to change, fix, align the data before landing it in stage
- quality cleanse and merge – mid-stream before landing it in the stage
- fast load to staging areas
- parallelism & partitioning for speed
- indexing – which often times, you can’t change on the source systems!
- restartability – the ability to “have the load pick up where it left off” simply by restarting it
- scalability – the ability to have the load run with hundreds of millons of rows, and do it fast
- scalability – the ability to “add another system” in to the existing process with little to no impact
why? why? why? why do people insist on solving all of the problems from data, to architecture, to systems design in a single step?
the end result is: massive complexity. ok, maybe not with the first star you build, but when you add a second source system, a third, a fourth, a fifth – or you begin adding 10 to 15 more conformed dimensions and a few more source systems – this is when it hits you like a ton of bricks!
this is when you’re original estimates of 3 months, turn in to “6 month and 9 month turn around times for the business”. this is when your system breaks every time you touch a loading process upstream, or every time you change the data model, there is huge impact requiring major re-engineering. this is when the business finally gives up and says: enough, i’ve had enough, you cost too much, and you take too long. i will now outsource the data warehouse, or i will shut it down and re-start it from scratch, or i will build it myself…
you’ve all been there, i know! if you’re not there, and you’ve experienced successes with the star schema in terabyte environments, or with a lot of source systems, or with real-time, then i’d love to hear from you. please enlighten me as to what you’ve done that works, so that i may learn how you succeeded.
ok, back to the data vault… in the next entry, i’ll cover what it brings to the table to help you solve these problems, and break the cycle!
ps: inside the coaching area, i teach you how to solve these problems, step-by-step.