(please feel free to comment on them by replying to this post).
the new pieces are marked in red as usual.
operational data vaults carry slightly modified rule sets from these, there will be a new specification started shortly for the operational data vault. what is an operational data vault? read the postings to find out.
section 1.0 – entity types
1.1 hub entity = unique list of business keys
1.2 link entity = list of relationships between business keys (composite keys), a link is also known as: transaction, granularity change, hierarchical relationship, recursive relationship, aggregate store.
1.3 satellites = descriptive information (data that changes over time).
1.4 stand-alone = tables like calendar/time or code/description. any table that is used in 60% or 80% of the model, where the business user has stated: i do not wish to store “history” on this section of data, and the keys are proliferated throughout the model. hubs, links, and satellites should never stand-alone. a stand-alone table may also be an intermediate join table (very much like a materialized view) existing for performance reasons. if the table is a “reporting table” (denormalized) then it belongs in a “report collection” area – (a different data mart/data distribution data base). there is now such a thing as stand-alone hub/satellites for history of codes and descriptions, but where denormalization of the data into other satellites causes too much data explosion, so lookups on the way out are a good thing to do. also known as: cross-reference, or lookup tables, they may or may not contain history – and if they contain history, they are to be modeled in their own hub/link/sat structures.
section 2.0 – common field elements
common field elements are system driven, and system managed – they do *not* fall under the scrutiny of an audit. they are generated fields on the way in to the target (stage, data vault, or star schema) and are necessary for assisting in the traceability of individual fields, but in and of themselves cannot be audited.
2.1 sequence id (required) – if surrogate keys are used, is the primary key of all tables
2.2 load date time stamp (required) – an attribute in the hub, and link – part of the primary key in a satellite. this is the date stamp of the date/time that the data was loaded into the database. this is stamped this way for consistency of information across the database. ** in a real time solution, or in a solution where the data is coming in from a cdc component (that stamps it with a time of change), it may be replacd with the extract date or date of change. as long as that date is mechanical from a trusted process **
2.3 record source (required) – this is the source system that generated the information, it is mechanically stamped when the information is loaded to the database. used when there is no meta-data project in place to trace information back to the source. it is provides tracability of every record at a granular level back to the source system. while optional, it is implemented by 98% of the customers today.
2.4 update user (optional) – this is there to track dba level modifications to the data. it is optional, and should be in another metrics tracking area.
2.5 update timestamp (optional). this is another dba tracking field. it also is optional and should be in another metrics tracking area.
2.6 last seen date (optional) – allows current tracking on hubs and links of the last time the key was seen on the source feed. this is a system generated, system defined date time stamp. since this is a data warehouse system generated field, controlled for a system view of the data, it is eligible for updating in place.
2.7 load end-dates (required) – this is the best practice today. represents the data warehouse system stamping of the life of the record in a satellite. since it is systematic, and it is maintained by the system for query purposes – it is not eligable for direct updating. load end-dates are now required in order to avert a historical data problem in a satellite that does not appear until load end dates are visible and in use.
2.8 extract date (optional) – this has proven to be beneficial if included in the module. there are times at which knowing what the extract date is, helps. however only in real-time systems does the extract date actually become the load date. in batch oriented loads, the extract date is attached as an extra information field (metadata).
section 3.0 hub rules
3.0 definition of a hub: a list of uniquely identified business keys that have a very low propensity to change.
3.1 a hub must have at least 1 business key.
3.2 a hub cannot contain a composite set of business keys. ** exception below
3.2.1 a hub should support at least one satellite to be in existence, hubs without satellites usually indicate “bad source data”, or poorly defined source data, or business keys that are missing valuable metadata
3.2.2 a hub key can be composite when: two of the same operational systems are using the same keys to mean different things and these keys collide when integrated back together again. please be aware: bad data causes breaks in these rules – these are guiding principles. exceptions to this rule should not happen (but do), also be aware, bad architecture in source systems causes breaks in these rules too.
3.3 a hub’s business key must stand-alone in the environment – either be a system created key, or a true business key that is the single basis for “finding” information in the source system. a true business key is often referred to as a natural key
3.4 a hub can contain a surrogate sequence key (if the database doesn’t work well with natural keys).
3.5 a hub’s load-date-time stamp or observation start date must be an attribute in the hub, and not a part of the hub’s primary key structure.
3.6 a hub’s primary key cannot contain a record source.
3.7 a hub may contain a last-seen-date if desired grain of tracking is needed.
section 4.0 link rules
4.0 definition of a link:
a ) a list of uniquely identified composite relationships between hub keys – must have 2 or more hubs or link keys combined.
b ) a hierarchical representation of a relationship or aggregation across a single hub’s key, migrated in exactly two times. any further hierarchy is broken down into two migrations, this way no limitation is placed upon the hierarchy, and the link is not playing a role-game which is dangerous. also, a hierarchical link must contain at least one satellite to indicate effectivity of the relationship (start and end dating of the hierarchical relationship.
4.1 a link must contain at least two imported hub or link primary keys
4.2 a link can contain two keys imported from the same hub for a hierarchical relationship, or rolled up relationship.
4.3 a link’s load-date-time stamp or observation start date must be an attribute in the link, and not a part of the links’ primary key structure.
4.4 a links composite key must be unique (a unique business key).
4.5 a link may contain a surrogate sequence key (if the composite is too large, or the database doesn’t work well with natural keys).
4.6 a link may contain 2 or more hub keys.
4.7 a links’ granularity is determined by the number of imported hub or link parent keys.
4.8 a link is a transaction, or a hierarchy, or a relationship.
4.9 a link may have zero or more satellites attached. except a hierarchical link as denoted above.
4.10 a link must be at the lowest level of granularity for tracking purposes.
4.11 a link must represent at most, 1 instance of a relationship component at any given time.
4.12 a link may have a last seen date for tracking purposes if desired.
4.13 in a hierarchical link, the child key is the primary driver for the relationship. this is the only instance in which a role-playing (half or part of the relationship) key is utilized. the child key will determine which effectivity satellite record to end-date. this is a defined and repeatable rule/pattern, and for hierarchical relationships is necessary. however, this rule does not hold for any other type of link, because it _is_ a role-playing rule based on one side of the composite key.
4.14 a same-as link takes the same form as a hierarchical link, but provides different context for usage, in that it allows differently named business keys to be “merged together” to a single master key – ie: this key is really the same-as this other key.
section 5.0 – satellite rules
5.0 definition of a satellite: any data with a propensity to change, any descriptive data about a business key – the data in the satellite must be separated by type (grouping) and rates of change (removal of redundancy).
5.1 a satellite must have at least one hub or link primary key imported.
5.2 a satellite cannot be attached to more than one hub – if it needs a composite key, then it must be attached to a link entity.
5.3 a satellite must have a load-date-time stamp (observation start date) as a part of it’s primary key.
5.4 a satellite may contain a sequence identifier or and ordering identifier as part of the primary key for uniqueness.
5.5 a satellite must contain a record source attribute for data tracability.
5.6 a satellite must contain at least one descriptive element about the hub or link to which it’s attached in order to be valid.
5.7 a satellite may contain system generated or aggregated attributes.
5.8 a satellite’s purpose is to store data over time.
5.9 a satellite may contain a code to a stand-alone code/description table, however if the code is tracked for history purposes, the code must be linked through to the hub on a link table. foreign keys are what is being referenced here. fk’s to reference tables are allowed, fk’s to a reference structure which is a single hub with satellite (code / description history) is allowed. indirect references to date time calendar table, or geography is acceptable. these fk’s are not to be represented within the data model, if the data architect wishes to represent these, then the requirement to use a link table is necessary – but there can be no link’s associated to a satellite structure, this will break the architecture.
5.10 a satellite must-have a load-end-date for efficient sql queries. this is considered best practice for 99% of the rdbms engines, as they do not yet handle time inherantly within the query sides.
5.11 a satellite may be split or merged at any time, as long as no historical value is lost, and no historical audit trail is lost.
section 6.0 – naming conventions
6.0 naming conventsions are enforced in order to meet the needs of very large data models. without naming conventions, the models will get out of hand and become unmanagable. there are field naming conventions required for the fields in the data vault – there’s a different section for suggested naming conventions for generic data vault models, the wizards here will work with the required naming conventions (suggested will be picked up and used if available).
naming conventions are required to help handle large models, as long as the naming conventions for each component are labeled and followed, then the model will be compliant. what the tables are named (prefix or suffix) won’t matter – as long as they follow a standard and documented naming convention.
the naming conventions below are suggestive, the elements below require a specific standard naming convention.
6.1 entity naming conventions
6.1.1 hubs – either prefix with hub_ or suffix with _hub or the letter “h”
6.1.2 links – either prefix with lnk_ or suffix with _lnk or the letter “l”
6.1.3 satellites – either prefix with sat_ or suffix with _sat or the letter “s”
6.1.4 hierarchical links – prefix or suffix with hlnk or hier or hl, please note: hierarchical link is a form of a link with specific rules (see above), it is not a true entity class of its’ own.
6.1.5 same-as links – prefix or suffix with slnk, or sal, or sa. please note: same-as link is a form of a link with specific rules (see above), it is not a true entity class of its’ own.
6.2 field naming conventions
6.2.1 record source – rec_src or record_source or prefix/suffix with rcsrc or rsrc
6.2.2 sequence id’s – seq_id or sequence_id or prefix or suffix with sqn
6.2.3 date time stamps – prefix or suffix dts
6.2.4 date stamps – prefix or suffix with dt
6.2.5 time stamps – prefix or suffix with tm
6.2.6 load date time stamps – prefix or suffix with lddts
6.2.7 user (dbo/trigger) watch fields – prefix or suffix with usr
6.2.8 occurrence number – prefix or suffix with ocnum
6.2.9 end date time stamps – prefix or suffix with ledts
section 7.0 end dating styles
all styles may choose to use point in time satellites – can be used for end-date indicators, or load-date indicators – basically providing a snapshot of freshness when the information needs to be rolled together. this is a good technique when feeding the tables in a near-real-time (eai) fashion.
7.2 style 2: you may use a load_end_date or observation end date as an attribute in your satellites. the time between the load date and load end date is the time span indicating the life of the data.
section 8.0 avoiding outer joins
in all these styles the queries can get complex if outer-joins are required. to simplify, there are two styles of avoiding outer joins. the preference of most users is style 1.
8.1 style 1: insert an empty satellite record (null’s or default values for everything except the primary key, and record source) for every new hub key (if the satellite data is not available during that load window). this allows the queries to equi-join and avoid outer joins.
8.2 style 2: insert one empty satellite record with a pk surrogate key of zero. this requires some tricky logic in the query to join to because the keys no longer match, but provides a single “empty” satellite structure rather than replicating empty records for every hub key. this is not a preferred method.
depricated rules – do not use
style 1: no end-dates, the time between consecutively keyed records’ load date time stamps is the time span for the life of the row, there is a problem with the satellites that doesn’t show until style 2 is utilized.
7.3 style 3: occurrence number in the primary key of a satellite. always keeping the occurrence number of the current record equal to zero. older records are numbered accordingly (most recent=1, further back=2 etc..)
***this requires updates to the satellites after loading, so that older rows get re-ordered. this notion is not typically utilized, as it causes severe hemmoraging at volume levels of data, or near zero latency of data, in fact, in the next revision of standards, this rule may is now phased out completely.
a link may not contain the same hub key more than once, unless used as a hierarchical definition. it may contain the same hub key twice if role based pk’s are setup (for instance, shipper_id is hub_customer_id and customer_id is also hub_customer_id)
*** this rule is wrong. role based keys cause problems with historical tracking or end-dating of satellites on the links. denormalizing the same key multiple times in a single link causes too many problems. please extract the links into multiple granularity.
** do not use going forward