In a recent post, I described loading patterns for Data Vault and how this might work going forward. There was a question asked about comparing the loading patterns for loading source->staging->DV per system versus for all systems simultaneously. Within the coaching area I go through these topics in great detail, and discuss the pros and cons, the different options, and also teach how-to make this a success. In this blog entry, I will highlight some of the issues, and provide a few hints and techniques on making each one work.
What do I need to know?
First, you need to know that both patterns accomplish the same goal. Then you should probably know that each one provides a potentially different result when querying the data (depending on design choices). Finally, you should probably also be aware that the closer you get to real-time information processing, the more likely you are to be forced to load data in parallel all the time (regardless of system).
So, then. Tell me about PER SYSTEM style.
When you load on a per-system basis, you are making a conscious choice to dictate that the “most recent system loaded within a given batch cycle – is in fact the Master system.” This is especially true if you are loading multiple systems to a single Satellite rather than splitting each Satellite by source system. *Note: this is discussed at length in the technical modeling book.
Per system loads, also indicate that the data from the entire system is 100% ready to be sourced at the same time. This can be difficult, and in some cases can cause the loads to wait far too long to even get started. Now, you could split the load into components, like: load all of single system to stage when the data is ready. Once all of stage is loaded for a system, continue downstream loading stage to Data Vault, etc… But this may lead to all kinds of interesting problems in the link tables – where associations are “expected” to be there, but arent, or don’t arrive until later when another part of the source is loaded.
Another statement you are making when you load per-system, is that: ALL DATA is self-contained. Meaning that none of the incoming data uses data from other systems during the loading cycle. But this is OK in most cases, as the Link tables are late-binding data driven anyhow.
The only time this “model” would cause a problem, is if the loading cycle took the data set from the Data Vault and moved it to Data Marts downstream without waiting for a complete view of the data to be loaded to the warehouse. Which brings up an interesting point in real-time feeds… When is it “safe” to move data downstream, regardless of the load style? (Which I’ll answer in another blog post later).
Ok – their are several issues with this style of loading and managing components, the best practice or recommended rules are NOT to follow this, but rather to follow a different rule. (provided further down the post).
Loading All Systems Simultaneously.
This method is the end-game. This is the point we arrive at, especially when loading the Data Vault in real-time. Real-time loading is different than Batch Loading. Today the lines are blurred, especially when discussing micro batch or mini batch. I put real-time loading at the 3 second interval arrival latency or less. Now, in general: if you are following this particular method (batch or not), you realize quickly that collisions happen in the Hubs and Links, and possibly in the Satellites. Well, the best practice is to split Satellites by source system (I explain why this is in the Technical book, and in the coaching area).
Ultimately the collisions in the Hubs and Links are Insert Only collisions, and if they can’t be managed by a read-only not-exists query, then they must be managed by the database deadlock mechanism, and two-phase commit. The ultimate rule there, is: whomever gets the lock wins the insert, the other is rejected as duplicate business key. This is one of the many reasons I also prefer database engines where the IDENTITY or SEQUENCE column is built-in to the tables, and isn’t a separate object.
Conclusions on this topic:
When thinking about batch loading, especially in big numbers, I prefer to:
a) create a batch load cycle
b) load ALL systems as soon as they are ready in massive parallelism to the staging area
c) once all the data has reached the staging area, begin loading all Hubs in Parallel
d) Then load all the Links in parallel
e) then load all the Satellites in parallel.
For me this drives out natural MASTER system selection PER SOURCE SYSTEM, as one Hub’s master is anothers’ child. This allows different master systems to dictate origination points for different keys. I guess you could say I use a bit of both techniques.
After all the data is loaded to the Data Vault, I then proceed to load ALL DIMENSIONS in parallel, followed by all FACT tables. This achieves highest parallelism possible for large scale batch loads. Real-time loads compete for resources, and ultimately use the database two-phase commit and locking system to decide who wins out.
Again, HOW and WHY are covered in detail in the coaching area.