primary key options for data vault 2.0
this entry is a candid look (technical, unbiased view) of the three alternative primary key options in a data vault 2.0 model. there are pros and cons to each selection. i hope you enjoy this factual entry.
(c) copyright 2018 dan linstedt all rights reserved, no reprints allowed without express written permission from dan linstedt
there are three main alternatives for selecting primary key values in a data vault 2.0 model:
- sequence numbers
- hash keys
- business keys
sequence numbers have been around since the beginning of machines. they are system-generated, unique numeric values that are incremental (sequential) in nature. sequence numbers have the following issues:
- upper limit (the size of the numeric field for non-decimal values)
- introduce process issue when utilizing sequences during load because they require any child entity to look up its corresponding parent record to inherit the parent value.
- hold no business meaning
the most critical of the issues above is that of negative performance impacts associated with lookup or join processes, particularly in heterogeneous environments or in environments where data is legally not allowed to “live” or be replicated on to other environments (geographically split, or on-premise and in-cloud mix). this process issue is exacerbated during high speed iot or real-time feeds. consider what happens in an iot or real-time feed when data flows quickly to billions of child records, and each record must then wait on a sequence “lookup” (one record at a time); the real-time stream may back up.
lookups also cause “pre-caching” problems under volume loads. for example, suppose the parent table is invoice and the child table is order. if the invoice table has 500 million records, and the order table has 5 billion records, and each order has at least one matching parent row (most likely more) – then each record that flows into order must “lookup” at least one invoice. this lookup process will happen 5 billion times, once for each child record.
it doesn’t matter if the technology is an etl engine, real-time process engine, or sql data management enabled engine. this process must happen to avoid any potential orphan records. if the referential integrity is shut-off, the load process can run in parallel to both tables. however, to populate the “parent sequence”, it must still be “searched / looked up” on a row by row basis. adding parallelism and partitioning will help with the performance, but eventually it will hit an upper limit bottleneck.
in an mpp environment (mpp storage), the data will be redistributed to allow the join to occur, and it is not just the sequence that has to be shipped– it’s the sequence plus the entire business key that it is tied to. in an mpp engine with non-mpp storage (like snowflake db), the data doesn’t have to be shipped but the lookup process still must happen.
this act of a single-strung, one record at a time lookup, can tremendously (and negatively) impact load performance. in large scale solutions (think of 1000 “tables” or data sets each with 1 billion records or more), this performance problem is dramatically increased (load times are dramatically increased).
what if there is one child table? what if the data model design has parent->child->child->child tables? or relationships that are multiple levels deep? then the problem escalates as the length of the load cycles escalate exponentially.
to be fair, let’s now address some of the positive notions of utilizing sequence numbers. sequence numbers have the following positive impacts once established:
- small byte size (generally less than number(38)) (38 “9’s”) or 10^125
- process benefit: joins across tables can leverage small byte size comparisons
- process benefit: joins can leverage numeric comparisons (faster than character or binary comparisons)
- always unique for each new record inserted
- some engines can further partition (group) in ascending order the numerical sequences and leverage sub-partition (micro-partition) pruning by leveraging range selection during the join process (in parallel).
what is a hash key? a hash key is a business key (may be composite fields) run through a computational function called a hash, then assigned as the primary key of the table. hash functions are called deterministic. being deterministic means that based on given input x (every single time the hash function is provided x) it will produce output y (for the same input, the same output will be generated). definitions of hash functions, what they are and how they work, can be found on wikipedia.
hash key benefits to any data model:
- 100% parallel independent load processes (as long as referential integrity is shut off) even if these load processes are split on multiple platforms or multiple locations
- lazy joins; – that is, the ability to join across multiple platforms utilizing technology like drill (or something similar)– even without referential integrity. note, lazy joins on sequences can’t be accomplished across heterogeneous platform environments. sequences aren’t even supported in some nosql engines.
- single field primary key attribute (same benefit here as the sequence numbering solution)
- deterministic – it can even be pre-computed on the source systems or at the edge for iot devices / edge computing.
- can represent unstructured and multi-structured data sets – based on specific input hash keys can be calculated again and again (in parallel). in other words, a hash key can be constructed as a business key for audio, images, video and documents. this is something sequences cannot do in a deterministic fashion.
- if there is a desire to build a smart hash function then meaning can be assigned to bits of the hash (similar to teradata – and what it computes for the underlying storage and data access).
hash keys are important to data vault 2.0 because of the efforts to connect heterogeneous data environments such as hadoop and oracle. hash keys are also important because they remove dependencies when “loading” the data vault 2.0 structures. a hash key can be computed value by value. the “parent” key can also be computed and can be repeated for as many parent keys as there exist values for. there is no lookup dependency, no need to pre-cache, use the temp area, or anything else to calculate each parent value during load processing.
big data system loads are nearly impossible to scale properly with sequence numbering dependencies in place. sequences (whether they are in dv1 or dimensional models or any other data model) force the parent to be loaded, then the child structures. these dependencies on “parent first – then lookup parent value” cause a sequential row by row operation during the load cycles, thereby inhibiting the scale out possibilities that parallelism offers.
this type of dependency not only slows the loading process down, but also kills any potential for parallelism – even with referential integrity shut off. furthermore, it places a dependency into the loading stream in heterogeneous environments. for instance, when loading satellite data into hadoop (perhaps a json document), the loading stream requires a look up for the sequence number from the hub that may exist in a relational database. this dependency alone defeats the entire purpose of having a system like hadoop in the first place.
hash keys do have their issues:
- length of the resulting computational value when the storage for the hash is greater than sequences
- possible collision (probabilities of collision are dependent on the hashing function chosen for utilization).
the first issue leads to slower sql joins and slower queries. this is because it takes longer to “match” or compare longer length fields than it does to compare numerics. hashes (in oracle and sqlserver) are typically stored in fixed binary form (yes, this works as a primary key). hashes in hive or other hadoop based technologies, and some other relational engines must store the hashes as fixed character set lengths. for example, an md5 hash result is binary(16), which results in char(32) fixed length hexadecimal encoded string.
the flip side is of using a hash is its unlimited scalability in parallel loading. all data can be loaded in complete parallel all the time across multiple platforms (even those that are geographically split or split on-premise and in-cloud). hash keys (or business keys) are part of the success of data vault 2.0 in a big data and nosql world. hashing is optional in dv2. there are a variety of hashing algorithms available for use that include:
- md5 (deprecated circa 2018)
- sha 0, 1, 2, 3 – sha1 (deprecated circa 2018)
- perfect hashes
- and more…
the hash is based on the business keys that arrive in the staging areas. all lookup dependencies are hence removed, and the entire system can load in parallel across heterogeneous environments. the data set in the model now can be spread across mpp environments by selecting the hash value as the distribution key. this allows for better mostly random, mostly even distribution across the mpp nodes if the hash key is the mpp bucket distribution key.
“when testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared test.“ https://en.wikipedia.org/wiki/hash_function – note there are those out there who claim not to see the average random even distribution, that just means they don’t understand the mathematics of the distribution of hash values or how to apply it.
luckily the hash functions are already designed, and the designers have taken this bit of distribution mathematics into account. the hashing function chosen (if hashing is to be utilized) can be at the discretion of the design team. as of circa 2018, teams have chosen sha-256.
one of the items discussed is the longer the hashing output (number of bits), the less likely / less probable for a potential collision. this is something to take into consideration, especially if the data sets are large (big data, 1 billion records on input per load cycle per table for example).
if a hash key is chosen for implementation, then a hash collision strategy must also be designed. this is the responsibility of the team. there are several options available for addressing hash collisions. one of the recommended strategies is reverse hash.
this is just for the data vault 2.0 model which acts as the enterprise warehouse. it is still possible (and even advisable) to utilize or leverage sequence numbers in persisted information marts (data marts) downstream to engage fastest possible joins within a homogeneous environment.
the largest benefit isn’t from the modeling side of the house; it’s from the loading and querying perspectives. for loading, it releases the dependencies and allows loads to hadoop and other nosql environments in parallel with loads to rdbms systems. for querying, it allows “late-join” or run-time binding of data across jdbc and odbc connectivity between hadoop, nosql, and rdbms engines on demand. it is not suggested that it will be fast, but rather that it can be easily accomplished.
deeper analysis of this subject is covered in data vault 2.0 boot camp training courses and in data vault 2.0 published materials. it is beyond the scope of this article to dive deeper in to this subject.
business keys have been around for a long time, if there have been data in operational applications. business keys should be smart or intelligent keys and should be mapped to business concepts. that said, most business keys today are source system surrogate id’s and they exhibit the same problems that sequences mentioned above exhibit.
a smart or intelligent key is generally defined as a sum of components where digits or pieces of a single field contain meaning to the business. at lockheed martin, for example, a part number consisted of several pieces (it was a super-key of sorts). the part-key included the make, model, revision, year, and so on of the part. like a vin (vehicle identification number) found on automobiles today.
the benefits of a smart or intelligent key stretch far beyond the simple surrogate or sequence business key. these business keys usually exhibit the following positive behavior at the business level:
- they hold the same value for the life of the data set
- they do not change when the data is transferred between and across business oltp applications
- they are not editable by business (most of the time) in the source system application
- they can be considered master data keys
- they cross business processes and provide ultimate data traceability
- largest benefit: can allow parallel loading (like hashes), and also work as keys for geographically distributed data sets – without needing re-computation or lookups.
they do have the following downfalls:
- length – generally smart business keys can be longer than 40 characters,
- meaning over time – the base definition can change every 5 to 15 years or so (just look at how vin number has evolved over the last 100 years)
- sometimes source applications can change the business keys, which wreaks havoc on any of the analytics that need to be done.
- they can be multi-field / multi-attribute
- they can be “not unique or specific enough” to uniquely identify data.
if given the choice between surrogate sequences, hashes or natural business keys – natural business keys would be the preference. the original definition (even today) states that a hub is defined as a unique list of business keys. the preference is to use natural business keys that have meaning to the business.
one of the functions of a properly built raw data vault 2.0 model is to provide traceability across the lines of business. to do this, the business keys must be stored in the hub structures according to a set of design standards.
most of the business keys in the source system today are surrogate sequence numbers defined by the source application. the world is full of these “dumb” machine generated numeric values. examples include: customer number, account number, invoice number, order number, and the list goes on.
source system sequence business keys
source system sequence-driven business keys make up 98% of the source data that any data warehouse or analytics system receives. even down to transaction id, e-mail id, or some of the unstructured data sets, such as document id, contain surrogates. the theory is that these sequences should never change and should represent the same data once established and assigned.
that said, the largest problem that exists in the operational systems is one the analytics solution is always asked to solve. that is, how to integrate (or master) the data set, to combine it across business processes and make sense of the data that has been assigned multiple sequence business keys throughout the business lifecycle.
an example of this may be customer account. customer account in sap may mean the same thing as customer account in oracle financials or some other crm or erp solution. generally, when the data is passed from sap to oracle financials, typically the receiving oltp application assigns a new “business key” or surrogate sequence id. it’s still the same customer account however the same representative data set now it has a new key.
the issue becomes, how do you put the records back together again? this is a master data management (mdm) question, and with an mdm solution in place (including good governance and good people) can be solved and approximated with deep learning and neural networks. even statistical analysis of “similar attributes” can detect within a margin of error the multiple records that “should” be the same but contain different keys.
this business problem perpetuates into the data warehouse and analytics solution typically because no master data management solution has been implemented up-stream of the data warehouse. therefore, to put together what appears to be “one version of the customer record” and not double or triple count, algorithms are applied to bridge the keys together.
in the data vault landscape, we call this a hierarchical or same-as link. hierarchical if it represents a multi-level hierarchy, and same-as if it is a single hierarchy (parent to child re-map) of terms.
placing these sequence numbers as business keys in hubs have the following issues:
- they are meaningless – a human cannot determine what the key stands for (contextually) without examining the details for a moment in time
- they can change – often they do, even with something as “simple” as a source system upgrade – this results in a serious loss of traceability to the historical artifacts. without an “old-key” to “new-key” map, there is no definitive traceability.
- they can collide. even though conceptually across the business there is one element called “customer account”, the same id sequence may be assigned in different instances for different customer accounts. in this case they should never be combined. an example of this would be two different implementations of sap; one in japan and one in canada. each assigns customer id #1, however, in japan’s system, #1 represents “joe johnson” whereas in canada’s system, #1 represents “margarite smith”. the last thing you want in analytics is to “combine” these two records for reporting just because they have the same surrogate id.
an additional question arises if the choice is made to utilize data vault sequence numbers for hubs and the source system business keys are surrogates. the question is, why “re-key” or “re-number” the original business key? why not just use the original business key? (which by the way is how the original hub is defined).
to stop the collision (as put forward in the example above) – whether a surrogate sequence, a hash key, or the source business key is chosen for the hub structure – another element must be added. this secondary element ensures uniqueness of this surrogate business key. one of the best practices here is to assign geography codes. example, jap for any customer account id’s that originate from japans’ sap instance, and can for any customer account id’s that originate from canadas’ sap instance.
multi-part source business keys
using a geographic code, as mentioned above, brings up another issue. if the hub is created based solely on source system business key (and not surrogate sequence or hash key), then with the choice above (to add a geography code split) the model must be designed and built with a multi-part business key.
the issue with a multi-part business key is with performance of a join. there are multiple mathematical tests and quantitative results that show time and time again that multi-field join criteria is slower than single field join criteria. it only goes “slower” in large volume or big data solutions. at this point, perhaps, a hash key or surrogate sequence in the data vault may be faster than a multi-field join because it reduces the join back to a single field value.
another alternative is to concatenate the multi-field values together thus forming somewhat of an intelligent key, either with or without delimiters. this would depend on how the business wishes to define a set standard for concatenating the multi-field values (i.e., the rules needed – just like the rules needed to define a smart key).
the last thing to watch when choosing a multi-part business key is the length of the combined or concatenated field. if the length of the concatenated fields is longer than the length of a hash result or surrogate sequence id, then the join will execute slower than a join on a shorter field. as a reminder, these differences in performance usually can only be seen in large data sets (500 m or 1 billion records or more). the hardware has advanced and will continue to advance so much so that small data sets exhibit good performance. there is simply not enough of a difference in a small dataset to make an informed decision about the choice of the “primary key” for the hubs.
the suggestion ultimately is to re-key the source data solutions, add a smart or intelligent key “up-front” that can carry the data across instances, across business processes, across upgrades, through master data, across hybrid environments, and never change. doing this would centralize and ease the pain and cost of “master data” and would lead to easier use of a virtualization engine. it may not require complex analytics, neural nets, or machine learning algorithms to tie the data sets back together later.
in fact, fixing these re-keying issues, according to one estimate – costs the business 7x the money to “fix” this problem in the warehouse, instead of addressing it in the source applications. fixing the problem in the data warehouse is one form of technical debt. (quote and metrics paraphrased from nols ebersohn).
if the source system cannot be re-keyed or the source system cannot add an “intelligent” or “smart key” which is a contextual key, the recommendation is to implement master data management upstream. if mdm cannot be implemented, the next recommendation is leverage the source system business keys (unless there are composite business keys) – in which case, a hash is the base level default recommendation.