There is a lot of talk about Data Vault automation. In this post I discuss the easiest method possible to automate the loading of Data Vault.
When you build ETL / ELT load routines for Loading a Data Vault, generally, you end up with hundreds of processes. At least one per source field/set of source fields that are tracked to each target. So if you have the following source table:
You end up with a minimum of the following load objects:
- Load Hub Firm
- Load Sat Firm
- Update Sat Firm End Date
So, a minimum of 3 ETL objects (data flow designs) for one single source table. So if you have 25 staging tables, you end up with a minimum of 75 ETL objects.
So how can we reduce the number of ETL objects and still maintain the loading cycle?
It’s really quite easy if you think about it. The solution is: parameter driven ETL routines. Yes, very easy.
So, you end up with the following patterns: (requiring only one load map per pattern)
- Load Hub
- Load Link
- Load Satellite
- Update Satellite End Date
- Update Last Seen Date Hub
- Update Last Seen Date Link
Now, by parameters – I mean very simple parameters. The following parameters are necessary:
- the SQL statement to get the columns out of the source staging table
- the cross join parent lookup SQL statement (to find the parent Sequence Number)
- the Insert statement or a dynamic target name – so that the target table can be changed on the fly
- in the case of updates – the dynamic update statement.
There are a few more pieces to get this going properly, but it can be done in Informatica, Kettle, SSIS, SQL Procedures, and I’ve been told it should work in Data Stage too.
At the end of this, I end up with maybe 20 or 25 ETL “template” objects, and as many parameter files as I need to move the data from source to target. Each parameter file may have up to 4 SQL based statements.
In case you are interested in downloading a working example on how to make this work, you can reply here (let me know you are interested). My newest on-line class over at: http://LearnDataVault.com/training/ will be covering this in detail.
PS: Data Vault 2.0 specifications help with performance and consistency. The DV2.0 specifications really help make these topics work much better, and I hope to release them in January 2013.