Code Generation for Data Vault, not as easy as you think!

please note: rapidace has been decomissioned and removed from the market place – it is not accessible anymore.

sorry for my long silence, i’ve been hard at work completely rebuilding the core of rapidace, and will soon make it available as saas. anyhow, this entry is to talk about code-generation for the data vault.  what you should consider, what you need to demand from the vendors of tools who generate etl/elt code, and what you should come to expect in this market.

it’s a well known fact that i’ve been working on rapidace.  the acronym stands for: rapid architecture consolidation enablement (or engine) which ever you prefer.  the whole notion of “maturing” your edw is to get the entire architecture from etl to data model to project under control, basically to formalize it using standards, and best practices.  once it has been assigned, designed, and prescribed, usually with a bit of brow-sweat and up-front hard work, the proper solution can be generated with near 90% accuracy.

of course, your mileage may vary depending on the vendor you choose… especially if that vendor has not elected to work with me.  however, back to the point.  there are certain things that you should ask of your vendor who is providing the tool to generate the data vault models.  hopefully i can shed some light on these considerations.

the inputs that are required by rapidace (new version) are full and complete.  they require up-front work in the definition and understanding of business keys.  particularly since rapidace works to consolidate multiple disparate models together.  the inputs you should be providing your tool include the following:

  • source data model – preferably in ddl format, but could be xml, xsd, dtd, cobol copybooks, or standard table structures.
  • target data model – the same as above
  • ontology – rdf2 or owl format – used to describe abbreviations, shared definitions, business keys, and a number of other things
  • mapping design preference – what good is a generator that doesn’t use a template system to produce it’s mappings for you?
  • etl naming conventions – for whatever etl or elt or sql you are generating, you should be able to specify the naming conventions you’d like to use.
  • data model naming conventions – these should specify the prefixes and suffixesused to classify and identify table structures in the ddl
  • normalization preferences – applied when reading xml, xsd, dtd, or cobol copybooks – the model should be normalized first before consolidation

and there are a few more that i can think of…  however, one of the points to this entry is to show you, that generating etl / elt code for your environment, particularly for multiple model situations, is not necessarily an easy process.  nor is it cut-and-dry.  the largest issue that i’ve heard about is the lack of up-front work done to identify business keys.  it’s the business keys in the data vault that make all the difference.

one more thought on the subject…

most tools, if they generate just etl should use the above inputs at a minimum.  if they don’t, then you should be asking them: why not?

at the end of the day, if nothing else, you should remember this: you can’t or shouldn’t forward engineer (automatically generate) a data vault model directly 1 for 1 with the source model.  this is not the proper way to do business, and to actually follow this line of thinking can turn your data vault model into mush.   if you or your vendor wants to generate data vault models, you must (or your vendor must) undertake the task of identifying and using business keys in their generation.  this is the proper and only way to truly auto-generate a data vault model, along with the etl code underneath.

a lot of engineering thought must go in to the proper preparations, or the model will not fit the bill, nor rise to the occassion.

thoughs?  what are your experiences?

dan linstedt

Tags: , , , ,

7 Responses to “Code Generation for Data Vault, not as easy as you think!”

  1. Roland Bouman 2011/04/27 at 3:38 pm #

    Hi Dan,

    forgive my ignorance, but why can’t or shouldn’t tools infer the business keys from database metadata?

    Perhaps I am underestimating the task but it seems to me the data dictionaries of many RDBMS-es offer a lot of opportunities to discover the primary key and unique constraints, and with some SQL query smarts it seems doable to discover whether a particular key is an artificial key (in which case I would figure it is a poor choice as business key) or a natural key (in which case it seems a reasonable default to pick the smallest one in terms of number of columns and/or amount of data)

    I can imagine that such a technical approach is not satisfying in all cases, so I agree that a tool should provide control over the choice of business key. But, and again, this may be ignorance on my part, but why would you require full-blown RDF2 or OWL support to build an ontology?

    kind regards,


  2. dlinstedt 2011/04/27 at 4:01 pm #

    Hi Roland,

    No forgiveness necessary. It is a valid question, and I’ll do my best to answer it.

    There is only a limited amount of “intelligence and context” about determining the columns / candidates for “business keys” in the database metadata. Databases do offer discovery opportunities through profiling and unique constraints, along with primary and foreign keys. However, they lack the necessary algorithms and metadata to find out “what the business is truly using”. Especially when it comes to “composite business key use”, or when the RDBMS disagrees with the application logic, or the constraints are broken and not enforced. I’ve seen many systems that say: “Foreign Key”, yet when the data is inspected, there are missing values, and the FK is invalid (broken) due to poor loading practices. I’ve also seen unique constraints which are broken, and I’ve seen data that is “NULL” in columns that are supposed to be the business key. Routines and algorithms can only get you there 20% to 30% of the way.

    Especially when trying to “build or forward engineer” a Data Vault model from 2 or more source systems. The complexities of “finding business keys” automagically without help tend to increase on an exponential scale, for every new model that is “introduced”.

    The ontology is a real-world application of accurately building a “corrected foundation” from which algorithms can glean “intelligence” about business keys, raising the success percentages all the way up to 60% to 90% correct, and in some cases, 99% correct.

    The RDF or OWL is merely a storage/declaration method for the ontology which has become widely accepted as the semantic data exchange preference. Anyhow, the largest point I can make here is this: The Ontology does NOT need to be a full ontology, it can be a sparse representation of just the interaction/connection of, the business keys.

    More on this later, I’ve got to run out right now, but this is definitely worth discussing.

    Dan L

  3. Roland Bouman 2011/04/27 at 4:14 pm #

    Dan, thanks for your reply. Looking forward to discussing this more in the future.

  4. Kent Graziano 2011/04/28 at 8:40 am #

    Granted, most source systems I have looked at over the years end up being woefully lacking in proper design and normalization with minimal referential integrity being enforced in the database. That said, I am thinking if the following were true, then I might be able to get 80-90% generation of a DV model from forward engineering:

    1. I have only one source system
    2. Every table has a declared and enforced primary key (PK)
    3. If the PK is a surrogate key, the table also has a declared and enforced unique key (UK) that represents the natural or business key. (This could be a composite key using multiple columns in the table)
    4. All relationships are enforced by foreign keys (FK).
    5. The forward engineering tool has a preference setting to let me state whether to use the PK or UK for the business key. (Better still by default if a table has a PK and UK it assumes the UK is the business key).

    I know this is ideal, but it helps me to think in these terms to set the stage for a DV effort and evaluate from a baseline how much work there really is. (I can also use this list with clients to help them understand why I need time to analyze their current systems more thoroughly and not just start loading data.)

    Did I miss anything?

  5. dlinstedt 2011/04/28 at 9:38 am #

    Hi Kent,

    No you didn’t miss anything. The biggest issue is coverage, and that’s what a sparsely populated Ontology covers. It only needs to fill in the gaps, bridge the gaps where the technical data is either missing or incorrect.
    With an Ontology, you can extend your reach beyond consideration of a single source system – in to one or more domains of knowledge that you can represent in the Ontology. With an ontology, you no longer have to require PK/FK enforcement. Business keys, unique fields (should be unique), possible candidates for generation and consolidation are well within the ability of an Ontology to define.

    This is what will make the next generation of RapidACE so much more powerful than anything on the market today.

    Cheers mate,
    Dan Linstedt

  6. Marco Schreuder 2011/04/28 at 5:59 pm #

    Hi Dan,

    After reading Kent’s comment I wondered about the scope of the required up-front analysis?
    Do you have to identify all business keys in every system before you can start a first increment?

  7. dlinstedt 2011/04/28 at 6:53 pm #

    Hi Marco,

    I suppose I should qualify and set the scope for the comments. When I think, write and speak about data warehousing, I generally consider the enterprise view – soup to nuts where a company generally has anywhere from 3 to 20 disparate source systems they wish to integrate in to a single warehouse model. I usually also am thinking about projects between 3 months and 9 months in length with 3 to 5 people involved full-time. Therefore, my thoughts, and my articles are generally directed at “how-to handle” complex solutions for organizations that face huge challenges. I figure if those can be solved, then solving the simpler cases is that much easier (most of the time anyway).

    However, to answer your question:
    No. This is the beauty of the design / architecture of the Data Vault Model. You CAN scope it way down, because you can incrementally build it up as you go along.

    I’ve always stated (and hold the belief) that a true Data Vault can be as small as 1 Hub and 1 Satellite. Meaning you’ve only identified a single business key, but even this holds value when integrated across lines of business.

    The flip side of this is: if you only have 3 to 5 tables in your Data Vault, then you probably can “code the loads by hand” as fast or faster than setting up the metadata for a code-generation system. Only when you’ve grown to 15 to 20, then 30, and upwards (tables) in your Data Vault model do you begin to need a solution that helps you a) generate ETL / ELT code (due to changes coming in, or new systems being integrated) and b) potentially consolidates disparate source system models for you.

    So in regards to business keys: Always choose the appropriate scope of the project first, identify the keys second, then fill in the information in the Data Vault third, then build the cross-reference and follow with the ETL.

    Ultimately, remember this:
    a) the Data Vault is flexible for a reason, the design can and should change as you learn new things
    b) adding new Business Keys later is a no-brainer, and of little to no consequence / impact to existing historical data and existing model, and existing ETL.

    So: start small, build up….

    It’s only when you have “big problems with big models” to tackle up-front, or in a short period of time, that you might need a solution like RapidACE to assist you.

    Dan L

Leave a Reply