this entry is a brief walk-through of issues around generating loading code for the data vault. some of these you may have thought of, others, maybe not. in any case, i’ve been working hard on rapidace v2, and trying to get the code-generation up as saas, but it’s been very difficult to get the templates right, and accommodate for most (if not all) the special things that have to happen. in case you are writing or have written a code generator, you might find this interesting – or maybe, you’ve already solved these issues, and if so, congrats… it’s a tough challenge.
what are the issues?
the issues stem from metadata, all the different possible combinations of cross-reference metadata that can actually be used to “guide” the code-generation process. i’ve run across hundreds of issues (so far), and a few of these that are particularly thorny are listed below, especially given that the code generator i’ve written uses cross-references and templates exclusively to generate the code.
- composite hub keys
- multiple hierarchical lookups for dependent children (link tables needing two parent hub lookups on the same field)
- ok – composite hub keys coupled with multiple hierarchical lookups for dependent children.
- multi-layered link tables (shouldn’t have this, but can!)
- detecting and identifying sequence fields (by naming convention, data type, etc..)
- utilizing customized satellite keys (how do you generate a working load pattern for a customized sat?)
- degenerate hubs (not-physically implemented, except for a key in the link table)
- field operations, filters, joiners, expressions, conditions
- field name changes (from input to output fields
the system i’ve constructed for rapidace v2 handles all of these requirements. now, couple this with the ability to “generate generic metadata” for any etl template to “use” to output mappings, and that’s where some of the complications come in. for instance, some etl engines handle filters differently, as they handle expressions differently (pretty much nearly everything is handled differently).
i saw a generic approach in the netherlands at the #dvseminar recently, to produce a single hub load, single link load, and single satellite load. this is very intriguing, given that these etl engines are so heavily structural driven. i’m beginning to wonder if we should be converting all our data sets to xml, and calling it good?
what have you run in to?
what types of issues have you had to solve in your code generators? i’d be curious to hear about it.