Trials Of DV Code Generation

this entry is a brief walk-through of issues around generating loading code for the data vault.  some of these you may have thought of, others, maybe not.  in any case, i’ve been working hard on rapidace v2, and trying to get the code-generation up as saas, but it’s been very difficult to get the templates right, and accommodate for most (if not all) the special things that have to happen.  in case you are writing or have written a code generator, you might find this interesting – or maybe, you’ve already solved these issues, and if so, congrats…  it’s a tough challenge.

what are the issues?

the issues stem from metadata, all the different possible combinations of cross-reference metadata that can actually be used to “guide” the code-generation process.  i’ve run across hundreds of issues (so far), and a few of these that are particularly thorny are listed below, especially given that the code generator i’ve written uses cross-references and templates exclusively to generate the code.

  • composite hub keys
  • multiple hierarchical lookups for dependent children (link tables needing two parent hub lookups on the same field)
  • ok – composite hub keys coupled with multiple hierarchical lookups for dependent children.
  • multi-layered link tables (shouldn’t have this, but can!)
  • detecting and identifying sequence fields (by naming convention, data type, etc..)
  • utilizing customized satellite keys (how do you generate a working load pattern for a customized sat?)
  • degenerate hubs (not-physically implemented, except for a key in the link table)
  • field operations, filters, joiners, expressions, conditions
  • field name changes (from input to output fields

the system i’ve constructed for rapidace v2 handles all of these requirements.  now, couple this with the ability to “generate generic metadata” for any etl template to “use” to output mappings, and that’s where some of the complications come in.   for instance, some etl engines handle filters differently, as they handle expressions differently (pretty much nearly everything is handled differently).

i saw a generic approach in the netherlands at the #dvseminar recently, to produce a single hub load, single link load, and single satellite load.  this is very intriguing, given that these etl engines are so heavily structural driven.  i’m beginning to wonder if we should be converting all our data sets to xml, and calling it good?

what have you run in to?

what types of issues have you had to solve in your code generators?   i’d be curious to hear about it.

dan linstedt

Tags: ,

One Response to “Trials Of DV Code Generation”

  1. Eric Steele 2011/06/03 at 10:04 am #

    I can only imaging the complexity of trying to handle this many nuances for multiple ETL engines.

    I wrote a set of Python scripts that completely regenerate our loaders. It pulls from one system, Consensus, and populates one ETL tool, Pentaho, and I had plenty of hurdles to figure out just doing that.

    The Pentaho transformations and jobs are XML files, so I have a Python module that keeps an XML template for each kind of component that I would need to use. I then populate the templates with the specific data that is needed for the specific loader and stack the filled templates together to generate the completed files.

Leave a Reply