Data Vault Loading Specification v1.2

1.0 goals and objectives
the following elements are part of the goals and objectives for loading the data vault. the data warehouse should truly represent the source systems in their original formats, and the only thing that should be done to the data is basic defaults, and error reporting.

1.1 consistent process architecture
each process must be consistent. loading every hub, link, and satellite should be exactly the same process template. there are 3 process templates today, and each can be represented in sql as the “t” in elt, or represented as the mapping design in etl. the eai process has the same process set as well. the queries must also follow a similar process template. getting data in and out should be consistent formats – to make the data vault fully maintainable, and provide the possibility for automating the sql build-out. (see the data vault wizards on this site for further information). a process should only break the template when the end-user has signed off on a special requirement, allowing documentation to be created. (see 1.2 below).

1.2 restartable process without requiring “change” to the process itself.
the processes that are built should contain cdc components so they can recognize what they’ve already loaded (should the process break in the middle). the fully restartable process means: when it does break, the only change is re-loading or re-setting the source/stage data set. the process itself does not change, and is simply restarted to reload the new data. you may (with sign-off) actually delete data from the data vault due to a processing error, but this should occur only in the alpha and beta release phases of the warehouse project. the restartable and automation requirements go hand-in-hand with six-sigma and s.e.i. compliance.

1.3 99.999999% up-time of the process – failing only on hard-errors.
a process should not fail because of data errors. a process should handle all possibilities of data issues by default values and inserts to error reporting/error logging tables. a process should only fail when: it runs out of disk space, it dead-locks, the network goes down, the server goes down, the process has no privelages, the passwords are changed, etc. the end-user owns all (100%) of the data set. (see spec 1.9). we should only be getting up at 2:00am to handle a busted process, not to handle data breaks or soft-errors. those should be architected out of the processing layers.

1.4 sei level 5 compliant process, with documented metrics and repeatability.

each process is rigidly defined by templates to contain restartability, and some layer of fault-tolerance. each process should be tracked in run-time (start to finish), row movement, run-cycle inclusion, and number of rows inserted, delta processed, and number of data errors. to be sei level 5 compliant, these metrics must be available on a run-by-run basis, and each of the processes should be fully restartable and fault-tolerant, as well as re-generated from an automated mechanism based on changes.

1.5 sarbanes-oxley and basil ii compliant processing.
based on government requirements, sox, and basil ii, our processes must now track – what data changed, when it was changed, what it was before the change, and what it changed to. this also means that no more manual fixes to the data sets at 2:00am are allowed by the i.t. team just to get the data loaded. this means, all processes must be fault-tolerant, completely restartable, and the data (if changed) must be traceable. it also means, that any data in the warehouse can be audited by (your) government or international accounting agencies looking for fraud. this means, that even if we “default” values going into the warehouse, we must track the “before changed” value somewhere in order to be compliant. it is also suggested that we track dual-date entry on the data, (load date and extract date and modified and applied date if possible). this allows aggregations to be created that roll by modified date (to see all the adjustments), or by applied date (to see totals for a specific month), and have both calculations of information pulled from the warehouse.

1.6 maintain full traceability of data
as discussed above, sox and basil ii,, basil iii along with government restrictions now require full traceability of the data. when it changed, what it changed from/to, and whom changed it.

1.7 data driven design
(see fully restartable designation above). this means that the process template does not change, that the loading mechanisms are fault-tolerant and handle all and any kind of data that hit it, that none of the data “drops out” of the load to the data warehouse (data vault). all data makes it in to the data vault no matter how bad it is. yes – it may be recorded in an error mart, but it still makes it to the warehouse. (see the 100% rule below). it also means that changing things like the load-date are data driven, and that the queries / views that pull data from the warehouse are also user-date driven without re-coding.

1.8 real-time provisions
all data vault loading templates should be applicable to eai or real-time template use as well. the real-time load stream should handle data as it arrives in a mini-batch, or burst. it should be grouped as a burst with a burst-datetime. it should be checked (delta’d) against what exists, and defaults values should be applied. errors in the data set should be recorded, and the data should be loaded. obviously, once a table is being loaded in real-time, it cannot be loaded by batch at the same time, (due to db restrictions and deadlocks). this means if batch has to load the table too, then real-time needs to re-direct it’s load temporarily to a staging table (even though the data may decay). the only tables that may run into contention is the hub and the link. the satellites should be unique to batch or real-time, and the data should be separated by rate-of-change (see the data vault specification).

1.9 100% of the data loaded to the data vault 100% of the time.
finally, all data arriving at the warehouse must be loaded into the data vault all the time. just because data formats or values are bad, doesn’t mean it should fall-out of the process to an error flat file. also doesn’t mean it should not be loaded to the warehouse. processing errors upstream, source system data errors – all must be recorded in the warehouse as to the fact it happened, and this is what the data vault saw. we are in the business of recording everything as a fact – establishing a system-of-record is of utmost importance. in order to get the data in, apply default values (to correct data type mismatches), and zero keys (to correct null key values), and record which records in what tables were changed (in an error logging mechanism).

1.10 large data scalability.
the loading processes should be linearly scalable in volume and timing. they should be able to scale into the terabyte/pedabyte and beyond levels. because of the architectural rules and the load rules about grain, we have effectively removed all the barriers (dependencies) normally seen in the data load cycle on other loads, and other data or even transformed data. because of this, the loads should be designed to be highly efficient and highly restartable at both large volumes and low arrival latencies.

2.0 batch – or strategic pre-requisits
the pre-requisits for batch loading are: using a set of staging tables, load data (truncate and re-load) to refresh, do *not* keep history in these staging tables, if you do, then your staging area is acting as a data warehouse (be warned of this danger). in batch loads, all processes must be consistent, and repeatable – highly parallel and scalable. the objective is “basic” data manipulation, such as: datatype cleanliness, and other issues (listed below). make use of partitioning (as needed by rdbms platform).

2.1 batch – staging definitions
the staging definitions should populate required fields upon load. the definitions should follow the rules below in order to sufficiently populate the data vault. it is possible to have multiple lavels of staging tables to assist with cross-table integration, and additional data set heuristics/statistical analysis.

it is suggested that these multiple layers assist with the integration of data, and not with the alteration (unless the data vault will be storing both the before and after pictures).

please be aware, with the advent of 64 bit systems – staging may actually be done in ram. also – if the source systems have cdc installed upstream, staging may not be necessary at all, it may be possible to load directly to the data vault structures. less storage requirements = less maintenance and less room for error.

2.1.1 required fields (structure):
these fields should be populated on the way in to the staging tables, keep transformation to a minimum during load of the stages in order to make the loads as parallel and independent (from other stage loads) as possible.
the structure should be as close to the source system as possible, so that the data isn’t completely altered going into the data vault staging area.
a: staging sequence number
b: load date
c: record source
d: extract date (if one is available)
e: optional: delta flag/new row flag
f: optional: target table sequence number

2.1.2 goals for stage table loads
the following goals and objectives should be met when loading the staging tables:
a: parallel – all stage loads are completely independent of any other stage load, sometimes relying on “small” lookups to add descriptions for codes.
b: run when the data is ready. whenever the data is ready on the source, that’s when the process should run, it shouldn’t have to wait until other data is ready.
c: big network bandwidth, lots of memory, lots of disk space – there should be a lot of available resources for the parallel batch loads so they can utilize as much of the machine as possible (loading data as quick as possible).

2.1.3 datatype changes going into staging tables
the following data type changes are suggested when loading the staging area:
a: character/string to datetime
b: character/string to numeric
c: split of over-loaded fields (if rules permit) into respective parts. there are cases where no amount of rules can be established to parse the field content, in that case, the fields should be loaded as-is.
d: character to var-character (varchar/varchar2) – trim all spaces (leading/trailing).
e: optional: upper case all strings (it is suggested that either upper or lower-case) be signed off on by the user – this makes processing much fasster when going into the satellites, and comparing keys in hubs. otherwise, the upper function must be executed across all satellite data when running compares (unless the user cares about tracking case-changes).

if charcaters cannot be converted to date/time then it’s recommended that they be defaulted, preferrably to 1970 (this way, no future-dating problems occur). it is required to get sign-off on any data that is to be changed/altered/defaulted.

if characters cannot be converted to numeric, it is recommended that they be defaulted to null. in this way, the numerics will not affect sums, counts, averages, and other mathematical algorithms. again, sign-off is required for default values.

the best practice is: if the disk space is available, store both the original value in character mode (prior to conversion), and store the new value in the “domain aligned field”. if the disk space is not available, then it is recommended to backup the staging area post-load, and zip it, and name it by date/time – then save it away. domain conformity is usually expected due to the way rdbms handles values within queries. when the disk is not available to store both values and the data is being aligned to the domain (datatype) of the field, then please refer to the default values section for further information on how to resolve values outside the domains.

there are new devices appearing on the market which “claim” to load the data while profiling it. this is still under investigation as to how helpful this can be.

2.1.4 default values
the suggested default values are as follows: pick characters that are readable/recognizable as a non-value by the end-user. remember that null’s in the data vault can always be changed on the way into a data mart. always have the end-users/business sign off on any default values you choose, and be consistent with the values you use.
a: defaults for datetime: 1/1/1970 00:00:00.0000
b: defaults for numeric: null
c: defaults for character: (1) ? (2+) ? or (3+) ?
just suggestions here, you can use whatever values you want, so long as they are signed off on.
alternately you can choose to leave non-critical date times null. you can also choose to change date-times to utc across the board for consistency.

2.2 batch – hub loading
watch for duplicates across staging tables if you do this in one sql statement, otherwise as separate load processes, duplicates will be naturally eliminated. just make sure the case chosen is consistent (upper/lower, etc.)
a: gather a unique union of all distinct keys in the staging tables that are targeted to that hub. make sure the “master” record source selected system is run first, either process wise or in the union sql statement.
b: check to see if they exist in the target table
c: if they do exist, filter them from the load.
d: if they don’t exist, assign the next sequence number and insert.
during the hub load process, it is possible to store the “business key to sequence id matchup” in a staging table area. this staging table is a second target from the load process (primary or first target is the hub itself). these staging tables are truncated and re-loaded, they serve as a fast-join for inserted and updated rows when going to load the links.

2.3 batch – link loading
the goal of loading the links is to have no duplicates across the business key structures (sequence id’s). each representation of an intersection or relationship must be unique. links represent any and all relationships that the source systems bring in. unique sequences for the links are generated when a particular relationship is seen. if the above “second level staging” tables have been built to hold the keys from the newly updated/inserted hub rows, then these are used to speed up the link generation process.
a. gather the business keys
b. default the “null” business keys to an unknown value or “zero record” in the hub.
c. go get the hub sequence id for that individual business key.
d. repeat “c” for all hub keys.
e. check to see if this collection of sequences exist in the link already, if so – filter out of the feed.
f. if it doesn’t exist, generate a new sequence for the link and insert it.
it is also possible (just like the hubs second level stage) to insert these results to a second level link stage, this makes attaching satellite rows to link keys much easier.

link loading can take a different tack as of late. there are two new styles for link loading to save time and complexity of the etl / elt components.

the first is: embed copies of the hubs/links (parent) business keys in the link – duplication – but remember this only works if you don’t have “dueling source systems” with same key, different record sources. this works in very specific cases. this allows the load routine to bypass the parents altogether and go directly after the link table. a variant of this is to use a computed hash key based on the business key such as crc32 or crc64, or md5 as the physical key – this allows computational “location” of the link record based on the business value in-stream. however be careful, crc32 is unique 1 out of every 10m rows. md5 and crc64 are much more unique than that, but also twice as large to store.

the second, and more probable, more scalable is to “record” the hub keys that were just looked up, or inserted (based on the instream flow that builds the hubs) – insert the surrogate keys to business key matches in a truncated and re-loaded staging table. this provides a “reference table” that can be joined directly back to the incomming source on a 1 for 1 match, and in fact if the table looks like the link table as far as keys are concerned, can then become a candidate as the “source” for the link table itself. it also keeps the size of the data set down to matching the size of the incoming data, so the link processing (and link satellites) no longer have to ask the questions of spreading themselves across the hubs to get the surrogate key matches.

by far, the second routine is easier to maintain, and doesn’t change the base-architecture, remember: anytime the base architecture is altered, flexibility may be lost…

2.4 batch – satellite loading
the goal of satellite loading is two fold: 1) to only load delta changes, those rows that actually have a change (from the last most recent load), even if the most recent load broke in the middle, and this is a re-start. this load should pick up where it left off 2) split the work, by type of data and rate of change.

sometimes satellites require link pk’s (link sequence numbers), and unfortunately, the fine-grained links are all dependent on many different hub key joins. to overcome this problem, run nested selects against the hubs/links when pulling the data out, or use the pit table, or maintain a second level staging table which “bridges” natural keys to surrogate keys for all the data existing currently in the stage (it may be another sequence to pull and load the 2nd level stage, but it will save tremendous loads of time going forward).

the process for batch loading on a satellite looks like this:
a. gather the link or hub surrogate key for the dependent satellite
b. join to a “current load-date” table, and compare the “most recent” data in the staging area to the satellite (as of the date-driven/data driven timestamp).
c. select only those rows that have changed
d. place the new load-date into the pk of the satellite
e. track the rows to be inserted, to make it easy to end-date old rows in the next pass, as well as easy to update the pit and bridge tables.
f. insert the new satellite records.

3.0 real-time – or tactical pre-requisits
the pre-requisits are that
a: you have the proper technology to get data in, in rapid succession,
b: your data is available in near-real time – whatever the business defines to be the “right-time” feeds, be it 5 minute refreshes, or 5 millisecond refreshes.
c: the business has guaranteed the money to solve the business problem, and can justify the exponential cost increase to getting data fed at lower intervals of latency.
d: your tool to load is most likely not an etl tool set, it is either elt, or eii, or eai (messaging/queueing).

please keep in mind that batch etl tools (even though they sometimes claim they can) cannot do a good job (today – october, 2005) of real-time loading under 1 to 3 minute intervals, moving at least 10,000 rows per cycle. also keep in mind that moving 1 row per minute is not the same as moving 100,000 rows per minute, which is not the same as moving 100,000 rows per second or milli-second.
technical pre-requisits:
a: the data model must support whatever latency requirements defined by the business.
b: snapshot tables refreshed at “regular intervals” must be defined to govern the results of queries, if data must be examined below the regular interval then it must be delivered to the end-user via alerting and thresholding.
c: data arrival must not be impeeded by data retrieval (table locking, index locking, mart refreshes, snapshot table updates, etc).
d: throughput is vital for scalability, meaning the least amount of complexity going in to the vault as possible.
e: data “decays” or get’s old when it’s staged, or landed in a near-real time system. data is considered “dead” (from a tactical standpoint) to some degree if it’s decayed past the next point of refresh. from a strategic standpoint, it’s still very much alive.
f: true real-time data feeds usually feed a “live” data mining engine in parallel with feeding the data vault. the resulting knowledge is then pumped back into the operational systems for immediate feedback to the end-user. (both f & g are true at the same time).
g: true real-time/right-time systems use a combination of bit flags, and triggers to execute code. the data is “regulated” or applied to a series of pre-built thresholds to see if it breaks the “standards” – then it is promoted to mining to judge the impact.
h: data (usually) cannot be staged, it cannot be “setup” to wait or be dependant on another systems feeds in order to make it into the warehouse / data vault. see #e on data decay rates, dead data is useless.

3.1 rt – staging definitions (if used)
staging definitions and staging tables are usually not used within a real-time system, unless the refresh time limit is at or above 3 minutes. staging tables cause data to decay. the longer it “sits” in the staging buckets, the less applicable to the current situation it is. the current situation is the state of the current operational system. if staging tables are elected for loading, time-limits must be set before another automated routine kicks in and deletes “old” or decayed data that has landed.

in most situations this amounts to a truncate and re-load of the staging tables, a complete wipe of data. in some situations real-time data is allowed to decay for up to 2 days before it is deemed unusable, or dead. but just because it’s sitting in a staging table doesn’t mean it hasn’t been “graded” or put through alternate processing downstream. it may mean that it has been seen, reviewed, and considered a non-impact therefore it is dropped from the insertion stream.
this is very dangerous and doesn’t play well with compliance unless records are kept as to what was deleted, how long it decayed before it was deleted, and when it was deleted. at that point, the record should be able to be used to “reload” an image of that data if necessary.

3.2 rt – hub loading
hub loading – in a real-time system the business keys nearly always arrive with the transactions, otherwise the source systems wouldn’t know how to attach the data to it’s own business processes. if you have a system in real time that isn’t providing the business keys, that’s a cause for serious red flags in which the sourcing business process needs to be changed or altered before proceeding. in the real-time case, it’s exactly the same cycle as the batch case for hub-loading, insert the key if it doesn’t exist.

please note that architectures that handle millisecond feeds use a millisecond time-stamp hub as their key, and sometimes we multiply the hubs out to avoid contention across the feeds. the millisecond timestamps capture just a time-stamp, from which a “computed surrogate” based on the millisecond can be generated – which means no lookups for existence, and no lookups for finding the hub key when inserting the satellites. computational logic is executed to “create” the surrogate for the satellites. the linking logic in millisecond time-frames can only be done by “lazy processes” that kick up every second to score/grade and associate data by context decisions – usually a data mining algorithm of sorts.

3.3 rt – link loading
link loading in real-time is quite easy, if the batch load link option from above is utilized (where the new keys/matched surrogates for the incomming data) are kept in a stage level 2 table. in other words, as hubs match keys to business keys, the matches are built into a new staging surrogate table – this will reduce the time it takes to load links. remember, the link usually is the transaction, which forces a hub insert or several, and usually becomes the date-time stamped satellite off the link immediately.

millisecond feed transactions are kept differently, again with a millisecond hub stamp (or several), and are joined across links via a “lazy” process that runs once a second, or once every few seconds to build the associations – in this case all we’re concerned with is: how fast can we roll data in, and when the data is “dead”, where can we back it up to?
so you might ask the question: the same data architecture, cool – but why two different data loading architectures? well, if we don’t load the data differently, we can’t match the speed of arrival of information. sometimes the information arrives so quickly that we have to change the manner in which it’s recorded. in other words, introduce parallelism to the processes, and move some of the complexity out of the direct loading path. hence the multiple copies of the same hub, and hence the lazy process for linking things together downstream.

furthermore, in these situations – triggering based on mathematical formulas is quite common, which means the math algorithms compute specific weighting factors and measurements by which the program can immediately determine: is this “fact” out of synch with the rest? is it a cause for an alert? and then raise the level of the fact as a potential issue, from there a data mining engine can further analyze it across the board and either downgrade it again, or raise it up to the operational system. all of this (usually) happens within 5 minutes of the arrival of the information.

3.4 rt – satellite loading
loading satellites in real-time is a bit of a challenge, unless the “most current” data is found easily marked. the last thing you want technically is to have to surf through the entire satellite looking for the current row. this becomes a problem in some real-time systems. but remember: satellites are split by type of data and rate of change, which gives us the ability to actually create a “current only” satellite, and again move the data from current to historical with a “slightly lazy process” at most, the current satellite will have only x numbers of records (usually 2 or below) for any given key. again, this is if the system is loading records every second or less.

if the system is loading one second or more intervals, then the data must be indexed properly to detect the “current record” quickly, from there comparing columns and inserting deltas. keep in mind that the issue of comparing columns in the satellite disappears as the arrival latency gets shorter. in other words, there are some systems in which the data arrives too fast to spend time comparing columns to the previous picture. in these cases, the data is loaded directly to the satellite. a lazy “cleanup” process is then instituted to detect duplicates and remove them within the satellite. this process may run every 5 minutes or less, depending on the context and the impact to the resources.

finally, a lazy delete or roll-off / backup process is instituted which also helps keep the table short, and running smoothly.

keep in mind that not all data will need real-time, up-to-the second loads, 80% or more of the system will be consistently above 10 minute refreshes, usually only 5% or less need second by second or millisecond refreshes – this allows us to focus our architecture and design to answer specific needs in specific parts of the model. if you have a full system that requires 1 second or less refreshes, then you’ve got a need to purchase enough horsepower to run all the necessary processing to keep things in synch through parallelism.

3.5 real-time and system of record
there’s a new movement afoot, and the data vault is providing the mechanisms for building operational data warehouses (we’ve finally stated these as published facts here on the forum). any data that arrives in a real-time data stream must be treated as system of record within the data warehouse. why? because we don’t know which system (if any) upstream is actually storing the data itself as an operational system. normally sor is the operational system upstream, but in the case of real-time feeds – it flips to become a responsibility of the data vault.
real time data is fed directly in to the data vault. if a backup of the data is absolutely necessary, then make it incremental backup at a snapshot level while using point-in-time tables to query and retrieve it. the other option is to “copy it” to a rolling staging table that never truncates. backup the staging table using query logic, clear the table of the rows which have been backed up. the problem with this? the staging table is in constant flux, and if the data is arriving in millisecond timing rates – the staging table can get very large very quick. which means, that backup get’s further and further behind.

within subsecond loads in a real-time environment, it is best to funnel the transaction direct to the data vault – leave your backups within the framework of the data vault, but they should be at the disk block replication level. it’s the only way to keep up. this is a dw2.0(tm) specification for on-line, and near-line storage. it also has to do with measuring the temperature of the data sets.

4.0 tracking and tracability – system defined field updates
(load dates, load-end-dates, last seen dates, tracking satellites, record sources)

4.1 loading load dates: every record in a batch cycle must contain exactly the same load-date and time unless each record is tied by a load cycle id. the function is to “tie” the records together as a group, in case something goes wrong with that load, and isn’t discovered until later. this way, the loaded group can be backed out (if necessary) and re-applied – this maintains compliance of the load cycle.
in an rt environment, the load-dates require time stamp of the individual transaction as it’s loaded to the warehouse. there is no “mechanism” to recover a batch of information for a single period of time, unless the transactions are loaded to the data vault and a staging area at the same time. this would mean a regularly scheduled “clean-up” process to “decay/delete/roll off” old records in the staging area, and a back up process to go with it.

4.2 load-end dates: load end dates should be 1 second or 1 millisecond before the next active record. depending on the grain of the rdbms engine. this provides a temporal view against the data, and allows the data set to be “aged” appropriately, and queries to execute spatial-temporal analysis. (see my postings on spatio-temporal analysis, or the posts regarding temporal indexing). load end dates are not the same as an “end-date” item fed to the warehouse by a source system, please don’t confuse the two. load-end-dates are mechanised and computed by the loading paradigm.

4.3 last seen dates: last seen dates are another type of tracking mechanism, they are generated (again) by the load process, rt or batch. they can be defined two ways, either as the last time we saw the data for the satellite, or the last time we saw the hub key on the feed. context determines the meaning. if they are in a satellite with other attributes – they can be “updated” (in the currently active record) without breaking compliance. why? they are a system managed field, not user modifiable, therefore they are not auditable.
if they are tracking a hub, they can be in a hub table – it is preferrable however to track a history of hub key arrivals by putting this one attribute into it’s own satellite. so maybe, a satellite with: last_seen_hubkey, and last_seen_sat1..satn data (depending on the satellites you want to track).
this can also be a part of the “tracking satellite” defined below:

4.4 tracking satellite: the tracking satellite is a system manged,
system generated satellite. it contains fields that are generated during the load cycles, and can be updated (if no history is desired), or preferrably, inserted to track history. it tracks the history of arrival and updates to the other tables like hub, link, and satellites. it provides additional metadata about the processing cycles that the data set actually goes through. it must be attached to the hub or lnk key set in order to be effective.

4.5 record sources: i’ve defined record sources many times, the definition just keeps getting better. record sources are system generated/managed elements, they are metadata about where the data in that record came from. they contain the source system, source application, and sometimes even the source function (if that can be provided). the more granular or specific the record sources, the more we can learn about the operations in our source system, and if they are meeting the business requirements. alas, this data is not auditable either.
record sources may now be a code. in codifying record sources we can save space, and use a reference table to retrieve them. this further improves performance of the data vault by shrinking the row size. i would recommend an alpha-numeric of 4 characters, this allows many different combinations and high levels of customizations. place the code/description in a reference area of the data vault, i.e. in a reference table.

5.0 business rule implementations
(default values, acceptable transformations of data, tracability, compliance issues, aggregations, etc.)

5.1 default values

before applying default values, pick a set of values that will work, and have the end-users sign-off on an sla (service level agreement) – so that you have a record, and so that when the users see these values they can relate to what they see has been defaulted.

5.2 generalities and business rules.
there are hard and soft business rules. hard business rules are those that have a very low propensity to change over time. these rules are the “lines in the sand” that provide consistency of integration on the way in to the hubs themselves. these hard business rules are assumption based. that said, we are in fact applying a “base-color lense filter” on the way in to the data vault. why? because we have to start somewhere, and we have to drive the data together at very specific and defined points of interest.

these hard business rules are executed during the load cycle, but because they don’t change very often (again we’re talking hub integration here) they do not appear to impact the agility of i.t. very much. soft business rules on the other hand are all those with a much higher propensity to change. such as the interpretation of the data set. these will be discussed in the new standard we are defining for querying the data vault.

Tags: , , ,

7 Responses to “Data Vault Loading Specification v1.2”

  1. gaborg 2010/10/29 at 1:56 pm #

    Hi Dan,

    I have a question regarding satellite loading (section 2.4).

    Am I right that I’ll have to create an empty satellite record
    when there is a new HUB record but there are no connecting data in the satellite’s source?
    (based on Modeling Specification Section 8.1 (Avoiding Outer Joins))

    It seems that this is missing from the satellite loading steps.

    If I compare the current satellite to its most current source in the staging
    this “missing” row won’t show up as new or changed.

    I think that in this case I’ll have to do an additional step to insert all missing HUB_ID-s into the satellite without descriptive attributes, too.

    Is it correct or I missed something?



  2. dlinstedt 2010/11/01 at 6:56 pm #

    Hi Gabor,

    The standards section is a guideline. I try to publish best practice rules for people to follow. You are correct in one manner of speaking: If you want to avoid outer-joins, and you don’t want to introduce EXTRA tables, then yes, the only way is to “go back” and insert 1 empty Satellite row in each Satellite for every Hub key that doesn’t have any Sat rows. This only has to happen 1 time for each key.

    On the other hand, if you don’t mind adding a new table structure, you can read about Point-In-Time tables to help with this.

    You are correct, you have not missed anything. Otherwise, you are stuck with outer-joins, OR splitting the loading process across inserts and updates, which means EVERYTHING is caught perfectly.

    Dan L

  3. rob mol 2011/10/19 at 8:03 am #

    Hello Dan,
    We are wordking for a cliënt and discussing how to process the deletes from a batch with change data captures.
    Setting the load end date for a deleted record has the disadvantage that we do the same in the process of updating. So we cannot recognize from the record itself whether a record is deleted or an old version of an updated record.
    This is specially a burden when propagating the updates and deletes to the next layer (to the bDv and the datamarts).
    What is your advice?
    Greetz, Rob

  4. Mike 2019/02/16 at 3:22 pm #

    This is nice. Would be a lot more helpful if there were examples of certain cases though. Like 10 times more helpful. A non trivial difference.

  5. Jeff Cerasuolo 2019/02/22 at 5:59 am #

    Hi Dan,

    I have been a great fan of the Data Vault approach for years and my company have been introducing it to Tesco and The AA in the UK. I have a few blockers that companies tend to throw back at the DV, so wondering how you suggest to overcome these rejections. Also wanted to ask about the Basel, Switzerland, Accords as these are quite important in Europe, even though GDPR is the hot topic in Europe these days.


  1. Integration in the data vault : IN2BI - 2011/05/10

    […] Some acceptable minor changes and default values (article 2.1.3 and 2.1.4 of the Data Vault loading specification) […]

  2. Data Vault Modeling | DWH-Consult - 2016/09/30

    […] English: Data Vault Modeling specification (DanLinstedt) Data Vault Loading Specification v1.2 (DanLinstedt) […]

Leave a Reply