Facebook Twitter LinkedIn YouTube RSS
magnify
Home Data Vault DV Standards DV Modeling Specification v1.0.9
formats

DV Modeling Specification v1.0.9

Published on 2010/05/13 by in DV Standards
Here are the rules for specification Version 1.0.9 of Data Vault Modeling.
(Please feel free to comment on them by replying to this post).
The new pieces are marked in RED as usual.

***
OPERATIONAL DATA VAULTS CARRY SLIGHTLY MODIFIED RULE SETS FROM THESE, THERE WILL BE A NEW SPECIFICATION STARTED SHORTLY FOR THE OPERATIONAL DATA VAULT. What is an Operational Data Vault? Read the postings to find out.
***

Section 1.0 – Entity types
1.1 Hub Entity = UNIQUE List of business Keys
1.2 Link Entity = List of relationships between business keys (composite keys), A link is also known as: Transaction, Granularity Change, Hierarchical Relationship, Recursive Relationship, Aggregate Store.
1.3 Satellites = Descriptive Information (data that changes over time).
1.4 Stand-Alone = Tables like Calendar/Time or Code/Description. Any table that is used in 60% or 80% of the model, where the business user has stated: I do not wish to store “history” on this section of data, and the keys are proliferated throughout the model. Hubs, Links, and Satellites should never stand-alone. A stand-alone table may also be an intermediate join table (very much like a materialized view) existing for performance reasons. IF the table is a “reporting table” (denormalized) then it belongs in a “report collection” area – (a different data mart/data distribution data base). There is now such a thing as stand-alone hub/satellites for history of codes and descriptions, but where denormalization of the data into other satellites causes too much data explosion, so lookups on the way out are a good thing to do.  Also known as: cross-reference, or Lookup tables, they may or may not contain HISTORY – and if they contain history, they are to be modeled in their own Hub/Link/Sat structures.

Section 2.0 – Common field elements
Common field elements are SYSTEM DRIVEN, and SYSTEM Managed – they do *NOT* fall under the scrutiny of an audit.  They are generated fields on the way IN to the target (stage, data vault, or star schema) and are necessary for assisting in the traceability of individual fields, but in and of themselves cannot be audited.
2.1 Sequence ID (Required) - if surrogate keys are used, is the primary key of all tables
2.2 Load Date Time Stamp (Required) – An attribute in the Hub, and Link – part of the primary key in a Satellite. This is the date stamp of the date/time that the data was loaded into the database. This is stamped this way for consistency of information across the database. ** In a REAL TIME solution, or in a solution where the data is coming in from a CDC component (that stamps it with a time of change), it may be replacd with the extract date or date of change. As long as that date is MECHANICAL from a trusted process **
2.3 Record Source (Required) – This is the source system that generated the information, it is mechanically stamped when the information is loaded to the database. Used when there is no meta-data project in place to trace information back to the source. It is provides tracability of every record at a granular level back to the source system. While optional, it is implemented by 98% of the customers today.
2.4 Update User (optional) – This is there to track DBA level modifications to the data. It is optional, and should be in another metrics tracking area.
2.5 Update TimeStamp (optional). This is another DBA tracking field. It also is optional and should be in another metrics tracking area.
2.6 Last Seen Date (optional) – Allows CURRENT tracking on Hubs and Links of the last time the key was seen on the source feed. This is a system generated, system defined date time stamp. Since this is a data warehouse system generated field, controlled for a system view of the data, it is eligible for updating in place.
2.7 Load End-Dates (Required) – This is the best practice today. Represents the data warehouse system stamping of the life of the record in a satellite. SINCE it is systematic, and it is maintained by the system for query purposes – it is NOT eligable for direct updating. Load End-Dates are now required in order to avert a historical data problem in a satellite that DOES NOT APPEAR until load end dates are visible and in use.
2.8 Extract Date (Optional) – This has proven to be beneficial if included in the module. There are times at which knowing what the extract date is, helps. However only in REAL-TIME systems does the extract date actually become the Load Date. In Batch oriented loads, the extract date is attached as an extra information field (metadata).

Section 3.0 Hub Rules
3.0 DEFINITION OF A HUB: A list of uniquely identified business keys that have a very low propensity to change.
3.1 A Hub must have at least 1 business key.
3.2 A Hub cannot contain a composite set of business keys. ** exception below
3.2.1 A Hub SHOULD support at least one satellite to be in existence, Hubs without Satellites usually indicate “bad source data”, or poorly defined source data, or business keys that are missing valuable metadata
3.2.2 A Hub Key CAN be composite when: two of the same operational systems are using the same keys to mean different things AND these keys collide when integrated back together again. Please be aware: BAD DATA CAUSES BREAKS IN THESE RULES – THESE ARE GUIDING PRINCIPLES. Exceptions to this rule should not happen (but do), also be aware, bad architecture in source systems causes breaks in these rules too.
3.3 A Hub’s business key must stand-alone in the environment – either be a system created key, or a true business key that is the single basis for “finding” information in the source system. A True business key is often referred to as a NATURAL KEY
3.4 A Hub can contain a surrogate sequence key (if the database doesn’t work well with natural keys).
3.5 A Hub’s load-date-time stamp or observation start date must be an attribute in the hub, and not a part of the hub’s primary key structure.
3.6 A Hub’s PRIMARY KEY cannot contain a record source.
3.7 A Hub may contain a Last-Seen-Date if desired grain of tracking is needed.

Section 4.0 Link Rules
4.0 DEFINITION OF A LINK:
a ) A list of uniquely identified composite relationships between hub keys – must have 2 or more hubs or link keys combined.
b ) A HIERARCHICAL representation of a relationship or aggregation across a single hub’s key, migrated in exactly two times. Any further hierarchy is broken down into two migrations, this way no limitation is placed upon the hierarchy, and the Link is NOT playing a role-game which is dangerous. Also, a Hierarchical Link must contain at least one Satellite to indicate effectivity of the relationship (start and end dating of the hierarchical relationship.
4.1 A Link must contain at least two imported Hub or link primary keys
4.2 A Link can contain two keys imported from the same hub for a hierarchical relationship, or rolled up relationship.
4.3 A Link’s load-date-time stamp or observation start date must be an attribute in the link, and not a part of the links’ primary key structure.
4.4 A Links composite key must be unique (A unique business key).
4.5 A Link may contain a surrogate sequence key (if the composite is too large, or the database doesn’t work well with natural keys).
4.6 A Link may contain 2 or more hub keys.
4.7 A Links’ granularity is determined by the number of imported Hub or Link parent keys.
4.8 A Link is a transaction, or a hierarchy, or a relationship.
4.9 A Link may have zero or more Satellites attached. Except a Hierarchical link as denoted above.
4.10 A Link must be at the lowest level of granularity for tracking purposes.
4.11 A Link must represent at most, 1 instance of a relationship component at any given time.
4.12 A Link may have a Last Seen Date for tracking purposes if desired.
4.13 In a Hierarchical Link, the CHILD key is the primary driver for the relationship. This is the only instance in which a role-playing (half or part of the relationship) key is utilized. The child key will determine which effectivity satellite record to end-date. This is a defined and repeatable rule/pattern, and for hierarchical relationships is necessary. However, this rule does NOT hold for any other type of link, because it _is_ a role-playing rule based on one side of the composite key.
4.14 A Same-AS link takes the same form as a Hierarchical Link, but provides different context for usage, in that it allows differently named business keys to be “merged together” to a single master key – ie: this key is really the same-as this other key.

Section 5.0 – Satellite Rules
5.0 DEFINITION OF A SATELLITE: Any data with a propensity to change, any descriptive data about a business key – the data in the satellite must be separated by type (grouping) and rates of change (removal of redundancy).
5.1 A Satellite MUST have at least one Hub or Link primary key imported.
5.2 A Satellite cannot be attached to more than one hub – if it needs a composite key, then it must be attached to a Link entity.
5.3 A Satellite MUST have a load-date-time stamp (observation start date) as a part of it’s primary key.
5.4 A Satellite may contain a sequence identifier or and ordering identifier as part of the primary key for uniqueness.
5.5 A Satellite must contain a record source attribute for data tracability.
5.6 A Satellite must contain at least one descriptive element about the Hub or Link to which it’s attached in order to be valid.
5.7 A Satellite may contain system generated or aggregated attributes.
5.8 A Satellite’s purpose is to store data over time.
5.9 A Satellite may contain a CODE to a stand-alone code/description table, however if the code is tracked for history purposes, the code must be LINKED through to the hub on a link table.  Foreign keys are what is being referenced here. FK’s to reference tables ARE ALLOWED, FK’s to a Reference structure which is a single hub with satellite (Code / Description history) is allowed. Indirect references to date time calendar table, or geography is acceptable. These FK’s are NOT to be represented within the data model, if the data architect wishes to represent these, then the requirement to use a LINK table is necessary – but there can be NO LINK’s associated to a SATELLITE structure, this will break the architecture.
5.10 A Satellite MUST-HAVE a Load-End-Date for efficient SQL queries. This is considered best practice for 99% of the RDBMS engines, as they do not yet handle time inherantly within the query sides.
5.11 A Satellite may be split or merged at any time, as long as NO HISTORICAL VALUE is lost, and NO HISTORICAL AUDIT TRAIL is lost.

Section 6.0 – Naming Conventions
6.0 Naming conventsions are enforced in order to meet the needs of very large data models. Without naming conventions, the models will get out of hand and become unmanagable. There are field naming conventions required for the fields in the Data Vault – there’s a different section for suggested naming conventions for Generic Data Vault models, the wizards here will work with the required naming conventions (suggested will be picked up and used if available).
NAMING CONVENTIONS ARE REQUIRED TO HELP HANDLE LARGE MODELS, as long as the naming conventions for each component are Labeled and followed, then the model will be compliant. What the tables are named (prefix or suffix) won’t matter – as long as they follow a standard and documented naming convention.

The naming conventions below are Suggestive, the elements below require a specific standard naming convention.
6.1 ENTITY NAMING CONVENTIONS
6.1.1 Hubs – Either Prefix with HUB_ or suffix with _HUB or the letter “H”
6.1.2 Links – Either prefix with LNK_ or suffix with _LNK or the letter “L”
6.1.3 Satellites – Either prefix with SAT_ or suffix with _SAT or the letter “S”
6.1.4 Hierarchical Links – Prefix or Suffix with HLNK or HIER or HL, please note: Hierarchical link is a FORM of a link with specific rules (see above), it is not a true entity class of its’ own.
6.1.5 Same-As Links – Prefix or suffix with SLNK, or SAL, or SA. please note: Same-As Link is a FORM of a link with specific rules (see above), it is not a true entity class of its’ own.

6.2 FIELD NAMING CONVENTIONS
6.2.1 Record Source – REC_SRC or RECORD_SOURCE or prefix/suffix with RCSRC or RSRC
6.2.2 Sequence ID’s – SEQ_ID or SEQUENCE_ID or Prefix or Suffix with SQN
6.2.3 Date Time Stamps – Prefix or Suffix DTS
6.2.4 Date Stamps – Prefix or suffix with DT
6.2.5 Time Stamps – Prefix or suffix with TM
6.2.6 Load Date Time Stamps – Prefix or Suffix with LDDTS
6.2.7 User (DBO/trigger) watch fields – prefix or suffix with USR
6.2.8 Occurrence Number – Prefix or Suffix with OCNUM
6.2.9 End date time stamps – prefix or suffix with LEDTS

Section 7.0 END DATING STYLES
All styles may choose to use Point In Time Satellites – can be used for end-date indicators, or load-date indicators – basically providing a snapshot of freshness when the information needs to be rolled together. This is a good technique when feeding the tables in a near-real-time (EAI) fashion.
7.2 Style 2: You may use a LOAD_END_DATE or observation end date as an attribute in your satellites. The time between the LOAD DATE and LOAD END DATE is the time span indicating the life of the data.

Section 8.0 Avoiding Outer Joins
In all these styles the queries can get complex if outer-joins are required. To simplify, there are two styles of avoiding outer joins. The preference of most users is style 1.
8.1 Style 1: Insert an empty satellite record (NULL’s or default values for everything except the primary key, and record source) for every new hub key (if the satellite data is not available during that load window). This allows the queries to equi-join and avoid outer joins.
8.2 Style 2: Insert ONE empty satellite record with a PK surrogate key of zero. This requires some tricky logic in the query to join to because the keys no longer match, but provides a single “empty” satellite structure rather than replicating empty records for every hub key. This is not a preferred method.

DEPRICATED RULES – DO NOT USE
END-DATING STYLES
Style 1: No end-dates, the time between consecutively keyed records’ Load Date Time Stamps is the time span for the life of the row, there is a problem with the satellites that doesn’t show until Style 2 is utilized.
7.3 Style 3: Occurrence Number in the primary key of a Satellite. Always keeping the occurrence Number of the current record equal to zero. Older records are numbered accordingly (most recent=1, further back=2 etc..)
***This requires updates to the satellites after loading, so that older rows get re-ordered. This Notion is not typically utilized, as it causes severe hemmoraging at volume levels of data, or near zero latency of data, in fact, in the next revision of standards, this rule may is now phased out completely.

A Link MAY NOT contain the same hub key more than once, unless used as a hierarchical definition. It may contain the same hub key twice if ROLE BASED PK’s are setup (for instance, shipper_id is hub_customer_id and customer_id is also hub_customer_id)
*** THIS RULE IS WRONG. Role based keys cause problems with historical tracking or end-dating of satellites on the links. Denormalizing the same key multiple times in a single link causes too many problems. Please extract the links into multiple granularity.
** DO NOT USE GOING FORWARD

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
11 Comments  comments 

11 Responses

  1. ovidiu_t

    Hi Dan,

    When modeling good architecture in source systems, things pretty much fall into place.
    When hitting bad architecture, exceptions to DV rules come along.

    Suppose we have the following definition in a source system:

    Order (
    OrderId (PK),
    OrderNo,
    ClientId,
    OrderDate
    )
    Client (
    ClientId (PK),
    ClientName
    )
    Product (
    ProductId (PK),
    ProductName
    )

    Everything is clear until the OrderLine is defined:
    OrderLine (
    OrderLineId (PK),
    OrderId,
    ProductId,
    Quantity,
    Price
    )
    There is no Unique Constraint on OrderId, ProductId. This means the system allows the same product to be placed more than once on an Order.
    Would this be an example of “bad architecture”?

    Modeling this into a Data Vault would result in something like:

    HUB_Client (SQN_Client)
    HUB_Product (SQN_Product)
    HUB_Order (SQN_Order)
    LNK_Order_Client (SQN_Order, SQN_Client)

    And now regarding OrderLine:
    1. Try to build the uniqueness of the link by adding the OrderLineId alongside the Order and Product

    LNK_Order_OrderLine (SQN_Order, SQN_Product, OrderLineId)

    Because we have here OrderLineId, in my opinion we break the guideline rule that a Link is a list of uniquely identified composite relationships between hub keys and must have 2 or more hubs or link keys combined.

    OR

    2.
    HUB_OrderLine (SQN_OrderLine)
    LNK_OrderLine_Order_Product (SQN_Order_Line, SQN_Order, SQN_Product)

    Here the order line is not a real standalone business entity and that breaks the rule regarding hub definition which states that in the hubs we need to have only business entities.

    I wonder what is the best approach. Are there other approaches you might consider?

    Thanks very much,
    Ovidiu

  2. dlinstedt

    Hi Ovidiu,

    This is generally reserved for coaching/training. These rules and concepts can be found in the Data Vault basics section, but they take quite a bit of explanation to understand. I call this a dependent child, and the line-item-number does belong in the Link table.

    Check out my coaching section for more information.

    Cheers,
    Dan L

  3. Mandy

    Hi Dan,

    How can we maintain history in Links? What if there are some relationships between the hubs that can change over period of time, can we use Load-End-Date to maintain history?

    Thank you in advance.

    Cheers,
    Mandy

  4. dlinstedt

    Hi Mandy,

    History directly in the Link violates the rules. No, we do not allow history to be maintained in the Link structures. I have a full lesson on this in the coaching area that explains why, what happens when you break the rules, and how to overcome this type of need. I also describe it in my technical modeling book (still in draft format inside the coaching area).

    The short answer to making this work is: use a Satellite called an Effectivity Satellite that allows you to set a status flag for a particular date combination.

    Never change the structure of the Link, if you do, you run the risk of re-engineering the model whenever a change comes down from the business.

    Hope this helps,
    Dan L
    PS: Check out coaching at: http://danLinstedt.com/my-coach

  5. Hi Dan, I have a question about the Satellite Rules:
    Is it required that every Hub record has at least 1 record in the corresponding sattelite or is it allowed that there are 0 records?

    I ask this question in relation with late arriving business keys / dimensional data:
    -One only knows the business key from a fact table, so the fact table to is the provider of the hub key
    -And then: do you insert a dummy record in the sattelite for the added hub key, or not?

    Regards,
    Rogier

  6. dlinstedt

    Hi Rogier,

    Thanks for registering, I appreciate it.

    I think it’s time for me to update the rules. No, it’s not required. In fact, Hubs are no longer required to have at least 1 Satellite either… That’s an optional guideline, a suggested guideline. And no, we do not require any dummy records for this reason.

    On the other hand, a “starter” dummy record in the Satellite generally makes the queries work with equi-joins. (especially if you already have other Sats on the Hub). If you have late-arriving data, then NO, I would not recommend using a dummy record at all. The dummy record is there as a “stop-gap” measure for joins, and for purposes of identifying keys that never seem to get any descriptive data, but when you think about it… they should appear in the queries anyhow with NULLS for the Sat info.

    Great question, thanks for the feedback. By the way, I will cover this in detail in the Coaching section as well. (more depth with examples of why/why not, and how to..)

    Cheers,
    Dan L
    Check out my coaching site: http://danLinstedt.com/my-coach

  7. Hi Dan, thanks for the quick reply!

    While we’re onto it, I have another question about Link rules:
    In the datavault definition on wikipedia there is a sentence (I know, it’s wikipedia, but still):

    “Links sometimes link to only one Hub, a construct called ‘peg-legged Link’ by Dan Linstedt. This occurs when one of the business keys associated by the Link is not a real business key”

    Doen’t this contradict the “Link must have 2 or more hubs or link keys combined” rule?

    Regards,
    Rogier

  8. dlinstedt

    Yes it does, and if the Wikipedia entry was corrected, it would continue to read: Peg-Legged Links are not allowed! I think there are 2 situations (only 2 Data Vaults) that I’m aware of where this impossibility exists, and the reason is: because the data was simply TOO awful to create a clear business key. The end-user could even change the surrogate/pk in the source! just horrible…

    Dan Linstedt

  9. Ok, that’s clear…

    But, how then would you model the following scenario:

    I have a Trade (Hub and Satelite) that records multiple Fees per Trade (Unique key: Trade / Fee number). The Fee number is meaningless: it starts with number one for each trade. There is no other potential real Business key that would make the Fee records unique.

    Does ik make sense to make a Hub based on Fee Number? (It is not a real business key)
    If I indeed would model Fee number as an Hub I could make a Fee link table (with links to Hub Trade and Hub Fee number)

    But isn’t it better to make a Link that only Links to the Trade Hub and add the Fee Number to the Link to make it Unique? But that would create the not allowed Peg-legged Link…

    Regards,
    Rogier

  10. dlinstedt

    Rogier,

    Thank-you kindly for your replies and questions, in order to fully answer your questions, please subscribe to the coaching area. This is typically the type of one-on-one service that I provide within the coaching area. it allows me to spend time, answering your questions with diagrams, downloads, best practices, and how-to’s. I appreciate the questions, they are very good questions – and require more time invested on my part to answer them accurately than I can give for free. If Centennium wishes to sign a corprate deal for access to my coaching site, I would be happy to work something out.

    Thank-you kindly,
    Dan Linstedt

  11. Ok Dan, thanks for your support so far.

    I’m going to look into that!

    Regards,
    Rogier Werschkull

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Current ye@r *