i’m often asked about the data vault, and the staging area – when to use it, why to use it, how to use it – and what the best practices are around using it. those of you who’ve been through my training, understand that there is a lot of ground to cover, and i cover all of this (and more) with examples, inside my one-on-one coaching area. that said, i will answer some of the above questions here in this brief post. this post is focused on batch processing and micro batch processing. this post does not answer the real-time feed questions.
what is a staging area?
a staging area is used in batch load situations. a staging area (typically) is a location on the same server as the data warehouse in order to eliminate network traffic between the etl from the staging area and the data warehouse. notice that i didn’t restrict the definition of staging area to database only… this is absolutely critical. sometimes staging areas are also called landing zones for flat files, xml files, cobol files and the like.
what is in the architecture of the staging area?
the architecture is: independent tables / files that “arrive” when the data is ready on the source. the whole point to staging batch data is: data ready –> logon to source –> get data in parallel process as fast as possible –> put data in staging area as fast as possible –> logout of source. by having them be independent, (no foreign keys, no referential integrity, no lookups, no matches, no checks of any kind) our it team can be agile and fast when new systems are due to be added to the edw, our processes are optimized to scale at high speed (whether or not we have big data to deal with). truncate and re-load makes the processes completely restartable.
what are the reasons for using a staging area?
- scalability – able to partition the structures as needed, able to add new parallel processes as needed
- flexibility – able to absorb new feeds, new columns, new systems quickly
- dynamic – able to load *optional* feeds that have undetermined arrival dates/times (inconsistent feeds, or on-demand feeds)
- restartable – able to restart the part of the load that failed, after the problem is fixed – without affecting all the other feeds
- schedule – able to schedule the stage load whenever the data on the source is ready, this removes serious timing depenendancies across the system (which hinder scalability and flexibility)
- backup and restore – after the entire staging area is “loaded” for a load cycle, i like to back it up, zip it, timestamp it, and compress it – then check it in to a version control system. this provides me with auditability along the way, as well as full restoration capabilities.
why split the staging area from the data vault?
the data vault is meant to have data warehousing abilities. the structures contain indicies, primary and foreign keys. the data being loaded to the data vault generally has a specific order (in batch only) – in order to tell the story of which system is actually delivering the data at what date/time. when you add these structural components to the data model, you push “processing sequences” up-stream. if you were to use a data vault (empty or truncated every load cycle) as the staging area, you would lose all the performance benefits, and flexibility benefits, and scheduling benefits from the list above.
restartability, backup and restore are still built in to the data vault. don’t get me wrong…. performance benefits are in the data vault model too – they just are viewed slightly differently.
using a data vault model as a staging area causes you to have to “worry” about availability and timing of the data set from the multiple source systems **** always try to think in terms of multiple source systems *** data warehousing folks sometimes make the mistake of thinking of only one source system when architecting their solutions.
what about etl performance and ease of load?
well, i’ve blogged on this many many times, in many places around the web. i’m currently building best-practices, explaining why, how, and the mathematics behind it all – inside the coaching area. but i’ll give you some information here that might make sense.
despite what you might think (ie: it’s easier to go from source straight to data vault…) – this is not true. especially in systems of scale or size. source to data vault means you push all the multi-system dependencies and availability of data issues *up-stream* back to the base loading cycle. the problem is: you have one source feed ready at 10pm the previous night, and another source feed ready at 4am the following morning. well, if they both load the satellite, or link, or hub – they have to be synchronized. especially if the 4am feed is the master system. now, you end up “waiting” to run the 10pm load cycle until after the 4am (next morning) feed is done.
well, what happens if the source system for the first feed is only available from 10pm to 12am, and then the window closes?
cross-feed dependencies (based on timing/availability) are one of the major reasons why edw/bi projects become in-flexible and non-agile as time goes on. these cross-feed dependencies (as described) are based on timing and availability of the data. which the system admins of those sources are constantly wanting to change the schedules around – re-prioritize the jobs. this by itself, usually leads to huge re-engineering efforts, super high costs for maintenance, and the beginning of the downfall of the edw/bi system!! this is where the source of the problems for business users funding the bi effort start to occur.
it’s eventually what forces a business to shut-down and re-build (from the ground up) the entire data warehouse. it’s the cause of the problem.
i’m sorry to be so adamant about this, but i’ve seen it in hundreds of business intelligence projects. i’ve also helped some of these companies avoid this pain in the future by moving to the data vault and following the standard and best practices that i’ve setup and defined.
etl performance: etl or el, or elt (doesn’t matter) is 4x to 10x slower when loading a table with primary/foreign keys, and indexes on when compared to an empty, truncated table with no primary/foreign keys, and no indexes. these factors matter. especially when the data set grows from 40 million to 400 million or to 1 billion rows to load per process.
etl performance to the data vault is fast, if the staging area is in the database to begin with. there are ways to get the database to “bypass logging”, shut-down indexes, delay foreign key checks, etc… if the operations are executed in database.
etl performance slows down if you go from source to data vault because of the timing issues, as well as the remote connection and ip transfer restrictions. note: this is *not* a problem of real-time or burst data arrival.
ease of load: well, it’s easier to load a standard staging table (to me) that mimics the source system, than it is to try and normalize the data (if i were to go from the source to the data vault directly). once all the data is sql accessible, has been aligned (datatypes only!!), and duplicates have been removed – then it becomes easy to load from stage to vault.
if you try to combine data type alignment, accessibility (direct to source), timing/availability, duplicate removal, source system ordering of load processes, and normalization in to a single etl process, you then are increasing (dramatically) the complexity factor of your source-to-warehouse load. when you increase the complexity ratings, your ability to remain agile as a solutions team begins to drop. your ability to maintain the system in an efficient and timely manner begins to drop. your ability to add new systems fast begins to drop.
there are too many benefits dropped (including scalability and flexibility) to use the data vault model as a staging area. but this is just my opinion.
i posted a blog entry on sequences recently, but here’s the gist: sequence numbers in the staging area should stay in the staging area. they are good for one thing only in the staging area: identification and removal of true duplicate rows. they should always start at 1 for new batch load cycles. they always need to be unique, but don’t need to be in order. this allows you to use this standard approach on any database system (columnar, appliance, mpp, smp, clustered… whatever). sequences in the data vault are there to stay, and are to be used as fast-join placeholders in the vault – they represent a 1:1 relationship with the business keys.
do any of you have a different opinion? do you agree with me? what do you think about this subject? pros and cons? register for the blog for free, then add your comment.
ps: don’t forget – there is a wide array of lessons about agility, flexibility, scalability, and architecture inside my on-line lessons at: http://datavaultalliance.com