Data Vault Basics
Here, we offer the best of breed, hybrid data modeling solution for your Enterprise Integration needs. Join the growing community today, and interact with the Data Vault community. Learn how to meet the needs of the enterprise faster, cheaper, and more reliably.
The Data Vault architecture offers a unique solution to business problems and technical problems alike. It is focused squarely at the data integration efforts across the enterprise and is built from solid foundational concepts. A key to understanding the Data Vault is understanding the business. Once the business is mapped out and the practitioner has a firm grasp on how the business operates, then the process of building the Data Vault can commence.
The Data Vault has many benefits which are produced as a by-product of the basic engineering. Sticking to the Data Vault foundational rules and standards will help get any integration project off the ground quickly and easily. There are several areas of the Data Vault which we’d like to cover with you before diving into the community / forums. In case you are interested, you can also read about some of the ISSUES faced by those who undertake the Data Vault modeling.
It is very easy to convert both 3rd normal form and Star Schema to Data Vault model architecture, here we show how to convert from 3rd normal form. Inside the community we walk through the conversion steps to go from Star Schema to Data Vault model.
| Data Vault Benefits | |
![]() |
Manage and Enforce Compliance to Sarbanes-Oxley, HIPPA, and BASIL II in your Enterprise Data Warehouse |
![]() |
Spot business problems that were never visible previously |
![]() |
Rapidly Reduce business cycle time for implementing changes |
![]() |
Merge new business units into the organization rapidly |
![]() |
Rapid ROI and Delivery of information to new Star Schemas |
![]() |
Consolidate disparate data stores., ie: Master Data Management |
![]() |
Implement and Deploy SOA, fast. |
![]() |
Scale to hundreds of Terabytes or Petabytes |
![]() |
SEI CMM Level 5 compliant (Repeatable, consistent, redundant architecture) |
![]() |
Trace all data back to the source systems |
The following are the concepts which we cover on this site:
- Definition of the Data Vault
- Data Vault Benefits
- From Business Case to Data Vault
- Data Vault in 5 Easy Steps
- Data Vault Through Pictures
- SEI / CMM / Compliance
Definition: The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.
![]() |
Extensive possibilities for data attribution. |
![]() |
All data relationships are key driven. |
![]() |
Relationships can be dropped and created on-the-fly. |
![]() |
Data Mining can discover new relationships between elements. |
![]() |
Artificial Intelligence can be utilized to rank the relevancy of the relationships to the user configured outcome. |
In Business Speak, it’s the ability to adapt quickly, model the business accurately, and scale with the business needs – converging IT and the business to meet the goals of the corporation. The Data Vault is a data integration architecture; a series of standards, and definitional elements or methods by way information is connected within an RDBMS data store in order to make sense of it.

The following business benefits are available as a result of building a Data Vault within an organization:
![]() |
Manage and Enforce Compliance to Sarbanes-Oxley, HIPPA, and BASIL II in your Enterprise Data Warehouse |
![]() |
Spot business problems that were never visible previously |
![]() |
Rapidly Reduce business cycle time for implementing changes |
![]() |
Merge new business units into the organization rapidly |
![]() |
Rapid ROI and Delivery of information to new Star Schemas |
![]() |
Consolidate disparate data stores. |
![]() |
Implement and Deploy SOA, fast. |
![]() |
Scale to hundreds of Terabytes |
![]() |
SEI CMM Level 5 compliant (Repeatable, consistent, redundant architecture) |
![]() |
Trace all data back to the source systems |
Data Vaults are extremely scalable, flexible architectures which allow the business to grow, and change without the agony and pain of “long IT lists of change impacts, and large cost outlays.” Typically when Businesses request a change to the data models (as a result of business changes), IT comes back with high costs, long implementation and test cycles, and long lists of impacts across the “enterprise warehouse”. With a Data Vault this is not the case, typically new functional areas of business are added quickly and easily, changes to existing architecture take less than 1/2 the traditional time, and usually have much less impact on the downstream systems.
The technical benefits to a Data Vault are listed below:
![]() |
Near-Real-Time Loads |
![]() |
Traditional Batch Loads |
![]() |
In-Database Data Mining |
![]() |
Terabytes to Petabytes of information |
![]() |
Incremental Build Out |
![]() |
Dynamic Model Adaptation |
![]() |
Business Rule Changes (with Ease) |
Since the Data Vault is based heavily on business process, it’s important to see how the business model represents the Data Vault, and how to make the transition from one to the other. Below is a series of descriptions and transitions that take you from one state of the Business Case model to the physical Data Vault Data Model that represents it. It also indicates how tightly tied the architecture is to the business itself. It also indicates how quickly the model can change when the business changes.
For this example, let’s examine a company with a marketing department that wants to build a sales campaign to sell slow moving products. The hope is that the customer will see the campaign, like the new low price of the product, and make a purchase from the company. When the analyst re-iterates what they heard in the user interviews, the analyst says to the business users, that he heard them say: “Marketing generates a sales invoice”. The business users quickly correct him (and the case model) and state that Finance generates the invoices. Of course, for this example hopefully the customer then provides feedback to the company about the marketing campaign, and company friendliness.
In the corrected business case, we change the model from the representation above, to the representation below. Not a large change, just a difference in departments. Hopefully there is communication between Finance and Marketing (which is not shown in this case example).
When we move on to discuss implementation of the business model, we begin to pair down the scope of the project – in order complete it in budget, on-time, and with limited resources. In this particular case, the Company decides that we can add Marketing, Finance, and the enterprise (company) later – thus making it out of scope. They tell us that we should focus on the implementation of campaigns, invoices, products, and customers. All of which are free-standing elements in business and have their own “tracking numbers”. In other words, it’s campaign MKT-1, Invoice Numbers, Customer Accounts, and Product Numbers that tie all this data together.
Upon further investigation, we discover that the business wants to track campaign effectiveness ratings, dates (length of time for the campaign), dates and amounts on invoices and line-items, products and their availability dates, descriptions, stock quantities and defect reasons; finally we discover they already have customer addresses, contacts, and other demographic details. The next model that we build is based on the concepts of the Data Vault and ties directly back to the business descriptions of granularity of data, and keyed information. For the purposes of scope and this example, we will limit the logical data model to the area outlined in RED.
Below is what our first cut logical Data Vault data model looks like. We take the business keys such as Invoice Number, and Product Number and build them into their own hubs. Then, we take the interaction between invoices, and products and build a link table called Link Invoice Line Item. Line items cannot “stand alone.” They depend on other key information to locate, and describe what they belong to. There is however, an error in this model (see HUB CUSTOMER ID embedded in the Link Invoice Line Item).
The error is a result of shifting grain into the wrong location. The correct grain is: invoices are tied to customers (usually 1 to 1 correspondence), with a customer capable of having many invoices. It is very rare (although does occur) that a business can actually have different customers per line-item, all build on the same invoice (this is the grain represented above). The proper line-item link table would not contain a customer ID. The corrected model would have an additional link between Hub Invoice and Hub Customer, to represent the interaction between customers and invoices.
Is the Data Vault Simple or Complex? Is it Easy to implement? Yes, it’s both simple, and easy to implement. It’s based on a set of redundant structures, and auditable principles. By utilizing the Data Vault standards your project will automatically gain the benefits of auditability, scalability, and flexibility. The following set of web-pages will guide you through the process of building a Data Vault in 5 easy steps.
Step 1: Establish the Business Keys, Hubs
Step 2: Establish the relationships between the Business Keys, Links
Step 3: Establish description around the Business Keys, Satellites
Step 4: Add Standalone components like Calendars and code/descriptions for decoding in Data Marts
Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures
Build your data marts, and your ETL loading processes, and away you go. Building a Data Vault gets easier as you go, eventually replacing the “band-aid” methods commonly used in Enterprise Integration Architectures. The model is built in such a manner that it can easily be extended when desired. The worlds smallest Data Vault consists of 1 Hub with 1 Satellite, say customer. The flexibility is built into the link-table concepts.
For those of you wishing to view examples of Data Vaults, we’ve provided some samples of different models. These are for viewing purposes, and are generic models which can be customized to meet your needs. The DDL for the Northwind Data Vault is available inside the forums. This is a standard data model, and full size images are available on request.
We start with a simplified 3NF data model, often times the source data model represents the business as it stands today – at least it represents the business processes which collect the information. Of course, we wish to correct some of the errors occurring in the business process as well.
Northwind 3NF Data Model:
The following is the first step in the series of 5 identified above. Identifying the business keys, and placing them in the standard Hub Structures. This can be challenging if the Business Keys are “smart-keys”, or composite keys made up of several identifiable relationships. Although, normalization to the nTH degree is discouraged, it is good to identify and document the metadata for these composite keys.
Northwind Data Vault Model, HUBS Identified

Step 2, identifying the Links or relationships between the business keys. This process can be a little tricky sometimes – particularly if the data set says the business keys are “weak”, and can’t stand on their own, or are non-identifying (non-unique). In these cases, we can end up with one-legged link tables, which is something we work through to correct during the modeling process. Again, the Link tables represent relationships, transactions, hierarchies, and define the grain of the data on the intersection.
Please note the importance of grain defined between order, employee, shipper and customer. The grain of this relationship (link table) is defined by the source system, however it is subtle and easy to miss.
Northwind Data Vault Model, Hubs and Links
The final image below shows a complete Data Vault, with all the hubs, links and satellites available. While the white-papers have defined Point-In-Time tables, and Bridge Tables, they are not required by the architecture. These alternate tables are utilized strictly for query performance reasons. There are certain MPP databases which can execute without “query assist tables” such as point-in-time and bridge tables. That said, many of the Data Vaults in place today are not implemented on MPP systems, and thus require these tables to be present in order to reduce the number of joins.
COMPLETE: Northwind Data Vault Model, Hubs, Links and Satellites
The Data Vault comes with rules, or standards which make the design repeatable, and redundant. Part of SEI/CMM (Software Engineering Institute, Capability Maturity Model) is to have organizations reach a level of business processes which are: documented, repeatable, redundant, fault-tolerant, and eventually: automated. It also defines risk analysis, KPA’s (key process areas) and KPI’s (key process indicators – metrics) for the organizations by which they can measure the improvement and accuracy of these business processes. These are equivalents to ISO900x, PMBOK, Six-Sigma practices in other disciplines.
Compliance has been around for years in the private sector, and when the Government put a mandate out that all government contractors will become SEI/CMM Level 5 compliant, compliance took on a whole new meaning. Data traceability came bubbling to the top, accountability of business users to find and fix their own business problems also rose to the top.
Compliance itself has many meanings, but in the case of Sarbanes-Oxley and Basil II/III accords, and other compliance initiatives – it strikes hard on data traceability, and business accountability. The IT staff must be enabled through the use of their models (Data Vault or Not) to track data as it stood, when it was created. In the case of the Data Vault, we’ve made it easier. The Data Vault provides a series of standard fields which track data changes by date, and where they came from – and the Data Vault is always modeled to hold the lowest possible grain of data from the source systems. That is to say: data is integrated, but not modified when it is stored in the Data Vault. The Satellites are split by type of data and rate of change which allow reduced storage requirements, and increased traceability.
The business rules which are usually implemented “on-the-way-in” to the data warehouse, are moved, shifted, to be implemented “on-the-way from the warehouse to the data marts”. Thus, allowing a single active data warehouse (Data Vault model) to be built as a statement of fact, and each of the data marts can represent their own “version of the truth” based on a single sourcing point.
When following the standards in loading information into the Data Vault, the IT staff automatically inherits “data compliance and traceability”. The business can point at the data mart and claim that it’s “wrong” today, and “right” tomorrow, but IT can always show where the data came from, when it came in, and what it looked like before alteration/transformation, aggregation and cleansing – thus meeting the compliance initiatives becomes easier.
SEI/CMM processes are defined as a part of the standards and have been built into the architecture, making the loading process, querying process and discovery process repeatable, redundant, and fault-tolerant.















Hi, Dan
When you talk about surrogate keys (for Hubs), does it have the same meaning as in Dimensional Modeling, i.e. a meaningless unique integer sequence number that acts as the primary key of a table (or Hub)?
Yes, all surrogates in the Data Vault follow standard definitions. They ALL are meaningless unique sequences. Now, as to the “integer” part – not necessarily. They may be number(20) in length – but that’s physical semantics. By the way, surrogates should NEVER EVER be let out of the database engine. The surrogates in the Vault should STAY in the vault, they are good for two things and two things only: 1) unique identification of rows, 2) joins. They should never be shown to business users, they should never be propogated out from the DV.
I’ll write another blog post on this one.
Cheers,
Dan L
I had another question, but as I was typing it, I started to answer myself. I’ll post it here anyway. I’ll pose the question in the form of an example. Suppose we have a HUB_Employee. I was tempted to put information that remains static (for example DOB and Gender) inside the hub instead of inside a Satelite of HUB_Employee. The problem with doing this is that you assume that the source for employee information will always be correct. However, if the gender in the source arrives as male (incorrectly) and is later changed to female (correctly), this change still needs to be captured. To do this requires a satelite.
Gender should never be part of a Hub structure, Gender is not a static element, precisely for the reason you mentioned. Using DOB and Gender for Hub Employee is not necessarily a good way to go. Employees should have a hub based on “badge number” or “employee number” – something the company assigns.
On the other hand, person… well – people are very difficult to identify uniquely, short of a DNA sequence, and even that isn’t 100% fool-proof.
Cheers,
Dan L
Dan, how would you go about modelling super-types and sub-types. For example, the adventure works 2008 R2 sample data base has a person table, with employee, sales person, etc. sub-types.
Marius
Hi, Dan
You mention that the DV contains 100% of the data 100% of the time. Does this include audit info in the source, like data changed, rowid’s, row guids, etc.?
Marius
Marius,
It includes any data that the source system has or generates. Anything an auditor would need to reconstruct what the source looked like at a specific point in time – thus the EDW passes the audit.
Hope this helps,
Dan L
This is covered in the forums. http://www.dataVaultInstitute.com – also, this is covered in class. The explanation is quite lengthy, so this would be a question best left to consulting or training.
Thanks,
Dan L
Hi Dan,
Are there any DVs that are being queried directly by end user tools (reporting tools), or do all the querying happen from a DM on top of the DV? If it is being queried directly, do you know roughly what percentate are being queried in this manner?
Marius
Hi Marius,
Yes. There are many that are built this way. There are three points to consider:
1) Direct query can only happen if RAW data is delivered to the business, and this is OK (signed off by business)
2) Direct query happens through Views, often called “virtual data marts” – and the joins underneath are pre-done. The views “look” like Dimensions and Facts to the query tool.
3) Direct query through views (virtual marts) only works as long as performance is fine. Once performance drops below acceptable levels, the views MUST be instantiated (physically built as tables).
Percentage wise, I couldn’t tell you – that information has not been disclosed to me by the customers.
Hope this helps,
Dan L
Hi Dan,
What do you mean by “and the joins underneath are pre-done” in point 2?
Many thanks
Marius
Hi Marius,
They are built-in to the views. In other words, the VIEW SQL joins the necessary hubs + sats = dimensions, links+sats = facts so the “BI query tool” doesn’t care about what’s underneath.
Cheers,
Dan L
Understood. Thanks
Hello Dan,
I have a question about the Business Keys you use in the hubs. For example in the Hub Customers you use the CST_NAME as the business key. Maybe I am wrong but I think the Business Key of this table in the Nortwind database is the CustomerID. This is the same for some other hubs.
Cheers.
Hi Ronald,
Yes – to make a point, I chose Customer Name (CST_NAME) in this example, that Natural World keys should be elected whenever possible. However, for this example it is an academic choice.
In reality – you are correct. CustomerID is the business key of the Northwind database (as are the other ID’s). The reasons are, that a) Microsoft had database developers build the model, not data architects, b) there was no real-world application on top of the database, c) there is no manner of garaunteeing that the same CustomerID nor even the same Customer Name will remain consistent throughout the releases of the Northwind database, d) the data model already changed from Northwind to Adventure Works which is an extension but has the same problems….
Again, it was an academic choice for example only. I highly recommend that you make a decision for business keys based on the business processes in the source systems, as well as what the business users see, use, and have access to. If it’s ID fields then so be it. Hopefully there will be some natural world key that makes more sense.
Hope this helps clear a few things up, thanks for the comment.
Dan L
Hello Dan,
I am trying to create a little datavault of some tables from a source system. I am wondering if it is possible to load the data from the staging area (1:1 copy of the source) directly into the datavault. I think it is easier to use an interfacing datavault as a staging area for the datavault which will be truncated and filled every day. By doing this, it is much easier to process the updates/inserts/deletes and also important, to use the surrogakeys which are created in the first step.
Example:
There is a select statement from a source system system which contains a sale, a product and a customer. This transaction requires two hubs in our datavault, a link and a satelite.
Hub_Product
Hub_Customer
Link_Customer_Product
Sat_Customer_Product
The select statement containt the following row:
ProductCode = AC1202
CustomerCode = CUS3033
Quantity = 12
Price = $ 12
When we make ETL, the productcode goes to the hub. A sequence number will be generated for the business key. Same for the customer. (Only if the business key is new).
Now my question. How can we proceed with filling the link and the satelite because we need the sequence numbers which are generated in the hubs. Can this be done in one step? Also when inserting the business keys into the hubs, there must be a check if it exists or is deleted. When doing this in 1 step, the dataflow taks for example in SSIS wil be very, very complex and not easy to understand.
Greets. Ronald
Hi Dan,
I am a student who study in Master Degree and doing my thesis as \a Data Vault to a Star Schema transformation algorithm\. if i am not wrong i see that a hub table could be chosen to be a fact table in some case. But i am still trying to build my algorithm. in the last, if it is possible, could you give me any kind of feedback for my algorithm. if you send or give me email , i will send you my documents by email to you.
kind regards,
Sainaa
Hi Ronald,
I think I understand what you want to do, but I’m not convinced there is value in bypassing the true staging area. I will add a blog entry on this topic, as a reply is a little too lengthy. However, please keep in mind that I offer much of this explanation with PowerPoints, Videos, Documents, and how-to coaching to answer these questions inside my coaching area. I offer one-on-one direct web-conference with you to walk through these questions as well.
Look for the new blog post today, and I hope I will have enough information there to answer your question.
Cheers,
Dan Linstedt
Hi Sainnaa,
I have an algorithm already in place. In fact, I already have a tool: RapidACE (http://rapidACE.com). The tool and the web site is currently under construction, I’m not sure yet what I will do with it. I will tell you this:
1. Going from DV to star, the general rules are: HUBS+HUB SATS roughly equal a DIMENSION. LINKS+LINK SATS roughly equal a FACT, and maybe a degenerate fact as well.
2. The ISSUE is as follows: the DV stores RAW data, the star schema should store transformed/cleansed data. So, the question to you (for your masters degree) is to figure out how to leverage a business rules engine/metadata repository and generate business rules in to the data loading routines.
This is not an easy issue to tackle, and it’s not as simple as drag and drop ETL GUI, because most business users do not use ETL engines. Now, I will be able to help you with your masters degree – but, you have to sign up for one-on-one coaching. I have a discounted rate (flat fee) for students, and am currently helping an individual get his doctorate degree.
Use the CONTACT form if you are interested in the coaching section for a discounted fee.
Thanks,
Dan Linstedt
Thank you for your advice even though i cannot register for coaching section because of my thesis deadline (beginning of the September). I am developing the transformation algorithm and a tool for user choices such as filtering, selection, aggregation, fact and dimension ….. if you send me any kind of advice to my email (oss...@yahoo.com), i will appreciate very much.
Greets, Sainaa
You can try an open-source tool called Quipo, check the web-site: http://www.datawarehousemanagement.org/ From a company called Qosqo out of the Netherlands. They are producing Data Vault models, Star Schemas, and ETL.
Regarding advice, send me your top 10 questions about the Data Vault Model, I’ll do my best to answer them on the blog. Use the contact form, or my email to get them to me.
Hope this helps,
Dan Linstedt
Hello Dan,
I have some questions on how you can trace data back to its source. Often in the environments that I work there are times where you have to merge 2 or more sources into one. For example I had a customer that had 2 sources of members. We had to do a merge based on a number of fields then we created a third member table that was the merged customers. It seems like all this would have to be done before the Data Vault. Is that correct? If so by the time we get to the Data Vault we already lost the traceability of the data since when merging 2 or more records we combined fields from the different sources. In the data vault everything would appear as coming from one source the merged table.
How would you handle this scenario? Do you data vault all the source tables as well as the merged table?
Thanks,
Cristian.
Hi Christian,
This is actually a topic I cover at length in my new technical modeling book: http://learnDataVault.com/purchase-book But I’ll try to give you a short answer here. #1) no, we never merge data, change data, or combine data up-stream of the Data Vault. This destroys auditability. #2) In order to keep the auditability in tact, we use the integration by business key (when available) according to the Hub, these two sets of incomplete information should both be linked to the same business key, therefore #3) we construct two different Satellites, one for each source system.
When you go to build the mart, is when you would merge the two Satellites together to form the correct picture.
Hope this helps,
Dan Linstedt
[...] week I attended a session at the office about Data Vault. This is a subject which isn’t discussed very often within the Oracle BI community. [...]
Hi Dan,
The example you use shows a single source system, where as a more realistic scenario would include several.
How would you model a scenario where you have 3 different Invoice systems ? Would they all be combined into one Hub_Invoice ? How do you account for the different (likely incompatible) attributes in this scenario ?
Thanks,
Leigh.
Hi Leigh,
Generally the multiple source systems *must* be integrated by Business Keys. When the keys are at the same grain, and are defined by the business to mean the same thing, then they are integrated in to the same hub. The Satellites hanging off the Hub are where the multiple attributes for different source systems are defined. I have a couple of chapters that discuss this very topic in my new e-book: Super Charge Your Data Warehouse, you can purchase the book at: http://LearnDataVault.com/purchase-book
Hope this helps,
Dan Linstedt
[...] Data Vault Basics (Dan Linstedt的blog上的文章) [...]
Hi Dan,
I was wondering how slowly changing dimensions are treated in DV.
Thanks
Maurizio
Hi Dan,
Just a small remark. It seems to me that you’re not talking about “business cases” but “uses cases”, a business case is a completely different beast.
And a mispelling, it’s not Basil II but Basel II.
Regards,
Edwin Lima
[...] good practice. Choosing whether to go with a Dimensional Model, Corporate Information Factory, Data Vault, OLAP Cubes, or some combination or variation of these methodologies, is a good place to start. [...]
[...] Dan Linstedt, créateur de la méthode de modélisation par Data Vault (par voûtes de données) donnera une formation de 3 jours assortie d’un examen de certification. Elle se tiendra à Montréal (Canada) dans les locaux d’AgileDSS du 25 au 27 janvier 2012. Pour plus de détails sur la formation, rendez-vous sur http://blog.agiledss.com/frPublié dans: janvier 13th par Omar Djaiz [...]
[...] to each other). Wikipedia got a good introduction of the topic and Dan Linsted provides a good introduction as well, but there are also books on the [...]