#datavault and #automation – reduce and reuse

There is a lot of talk about Data Vault automation.  In this post I discuss the easiest method possible to automate the loading of Data Vault.

The problem…

When you build ETL / ELT load routines for Loading a Data Vault, generally, you end up with hundreds of processes.  At least one per source field/set of source fields that are tracked to each target.   So if you have the following source table:

NDC_FIRMS
firm_name
firm_address
firm_city
firm_state
firm_country

You end up with a minimum of the following load objects:

  • Load Hub Firm
  • Load Sat Firm
  • Update Sat Firm End Date

So, a minimum of 3 ETL objects (data flow designs) for one single source table.  So if you have 25 staging tables, you end up with a minimum of 75 ETL objects.

So how can we reduce the number of ETL objects and still maintain the loading cycle?

The solution…

It’s really quite easy if you think about it.  The solution is: parameter driven ETL routines. Yes, very easy.

So, you end up with the following patterns: (requiring only one load map per pattern)

  • Load Hub
  • Load Link
  • Load Satellite
  • Update Satellite End Date
  • Update Last Seen Date Hub
  • Update Last Seen Date Link

Now, by parameters – I mean very simple parameters.  The following parameters are necessary:

  1. the SQL statement to get the columns out of the source staging table
  2. the cross join parent lookup SQL statement (to find the parent Sequence Number)
  3. the Insert statement or a dynamic target name – so that the target table can be changed on the fly
  4. in the case of updates – the dynamic update statement.

There are a few more pieces to get this going properly, but it can be done in Informatica, Kettle, SSIS, SQL Procedures, and I’ve been told it should work in Data Stage too.

At the end of this, I end up with maybe 20 or 25 ETL “template” objects, and as many parameter files as I need to move the data from source to target.  Each parameter file may have up to 4 SQL based statements.

In case you are interested in downloading a working example on how to make this work, you can reply here (let me know you are interested).  My newest on-line class over at: http://LearnDataVault.com/training/ will be covering this in detail.

PS: Data Vault 2.0 specifications help with performance and consistency.  The DV2.0 specifications really help make these topics work much better, and I hope to release them in January 2013.

11 Responses to “#datavault and #automation – reduce and reuse”

  1. ramces 2012/11/26 at 7:59 am #

    Hello,

    I’m interested in this part.

    Thank you for all

  2. Pierre 2012/11/26 at 10:21 am #

    Hi Dan, I’m interested in downloading a SSIS working example. How can I get the code downloaded?

    Thank You,
    Pierre.

  3. dlinstedt 2012/11/26 at 10:34 am #

    Hi Pierre,

    The code / working examples will be included as part of the on-line class. Once I have the class / lessons posted, you can purchase access to the on-line lesson, which will give you the working example (along with the video that shows you how to actually make it work).

    I hope to post the first class (using Informatica PowerCenter 9.x) + examples, by the 15th of December. Shortly there after (in January) I should be able to post the SSIS edition of the class.

    Thanks,
    Dan L

  4. dlinstedt 2012/11/26 at 10:35 am #

    Hi Ram,

    Thanks for your feedback. You can watch the release (or register for the mailing list) by going to http://LearnDataVault.com

    The class will be for sale (for a reasonably priced fee) very very shortly. I hope to have one class for PowerCenter, one for SSIS, and one for Pentaho Kettle on the site.

    Thanks,
    Dan Linstedt

  5. Jonathan 2012/11/26 at 10:36 am #

    Hey Dan,

    I am very interested in reviewing a working example of this concept. Penu and I have been floating around ideas on how to do this, and would like to see your approach.

    Thanks,
    Jonathan

  6. dlinstedt 2012/11/26 at 10:38 am #

    Hi Jonathan,

    Basically (for all intensive purposes), I will be posting this as an on-line class with working examples.

    In reality this makes “nearly all code generators for loading Data Vault” – obsolete.

    Instead of “generating the code” for each of the ETL tools, now all you need to generate are the parameter files.

    Makes it REALLY simple, and easy to maintain. EXCEPT in one particular area: loading Satellites. This can become difficult, but there are several solutions to this that you might like. I will be covering all of this in detail (pros/cons/risks, etc..) in my upcoming on-line class.

    Thanks,
    Dan Linstedt

  7. Stephen Schleicher 2017/11/08 at 10:55 am #

    Hello Dan,

    I’m a software engineer and recently read your Data Vault 2.0 book (awesome read btw). We currently have a 100% metadata driven ETL engine that generates star schemas. After reading your book (and specifically the chapter on metadata manager), I quickly began to see the same possibility of automating the ETL/ELT loads into the vault based on the data in the metadata manager. We would like to modify our existing engine to support the patterns related to data vault so that we can achieve the equivalent metadata driven process for the vault as we did for our star schema.

    I do have a question though as it relates to AnalytiX DS. I’m a little confused about the mapping manager they speak of that drives the ETL. Is this mapping manger the equivalent of the meta mart mentioned in your book? The meta mart certainly seems to have all the needed information to drive an automated ETL process. Or does the mapping manager extend the meta mart with additional schemas to support the product?

    Also, I would be very interested in the working sample mentioned in this article. would love a link to any sample you may have.

    Cheers!

  8. Dan Linstedt 2017/11/09 at 8:37 am #

    Hi Stephen,

    Thank you for reading our book, we really appreciate it. I will be sure to pass along the message to Michael. Glad to hear you have developed a product, happy to have more automation products in the market place.

    Regarding AnalytixDS – have you looked at their web-site? They have a product called Mapping Manager, which starts with Source To Target mappings, and then, from there – has a ton of functionality. Regarding the metadata that we have in the book – No. We did not use AnalytixDS or any other product to write the book. The Meta-Mart is one of your own design. Meta Data must be kept somewhere globally, and there are many different products that do this.

    To be clear: Mapping Manager generates ELT and ETL code for other implementation engines: SQL, Map/Reduce, Informatica, Data Stage, SSIS, Talend, and so on…. Mapping Manager is extremely capable, and can also do visual drag and drop source to target, visual lineage, reference data management, testing automation, and more…. As a side note: I partner with both WhereScape and AnalytixDS for automation functionality in the market place.

    Regarding the sample that is specially developed by me. I built the patterns in to Informatica and SQL, and put them in to the hands-on classes that are currently listed on http://LearnDataVault.com

    You have to take the classes to download the samples.

    Thank you kindly,
    Dan Linstedt

  9. Stephen Schleicher 2017/11/09 at 10:29 am #

    Thanks for responding Dan. I did visit their website (went though the high level stuff and signed up for their seminar happening soon). And I do realize the mapping manager generates the ELT/ETL and that you didn’t base the book on that. I was actually hoping for the opposite (that they kind of built it to leverage a meta mart). In order to create the automated ETL, it would seem like that product would need all the information that you lay out in the samples from the book regarding the metadata manger as a minimum (mappings, source system precedence, hub info, etc…). So if we already made an effort to build our meta mart, would buying that product leverage all the information in the meta mart or would we have to migrate all of our meta mart data into mapping manager and have duplicate places to maintain this information concerning our vault?

    Does that make more sense as to what I’m asking (sorry if it’s not clear) or am I completely missing the point of the metadata mart. I thought three of the biggest benefits of having it was to describe the vault by showing lineage, support self-service BI efforts, and to provide the foundation for building automation.

  10. Dan Linstedt 2017/11/09 at 10:39 am #

    Hi Stephen,

    Actually for AnalytixDS – you CAN import tab delimited formats / CSV of metadata, so all you would really need to do is: build a view that provided the information in the format that ADS accepts or imports…. That is an easy thing, then export to a file, and away you go…. Yes, you can do that – no problem!

    However, it would be two separate sets of metadata at that point…. ADS would “not” synchronize with your meta-mart that you have built. On the other hand you CAN write your own “extension” to plug in to the environment, and import those definitions on demand. OR you can write something command line driven to import the file (I believe their latest edition does have a command line available?? you’d have to check with them).

    But yes – it would keep the imported metadata separate from what you have in your meta-mart.

    Dan

  11. Stephen Schleicher 2017/11/09 at 11:00 am #

    That makes sense. Thanks again!

Leave a Reply

*