in this entry i will explore the use of implementation of data vault on teradata. however, this entry is applicable to everyone in the data vault community, as it covers definitions and descriptions of primary keys, indexing, surrogate keys, natural keys, and performance and tuning. please note, if you notice that i have quoted or stated something incorrectly, please by all means, leave a correcting comment below.
to be specific, the use of surrogate keys versus natural keys on teradata. the last thing i would want to do is mis-state the teradata environment. this is in response to an excellent question i received via e-mail.
first things first
before i get going, this material is reflective of the kind of training, and in-depth experience i bring to the table in both my on-line courses at http://datavaultalliance.com (we go beyond just data vault) and my on-site classes. this material is the result of actually working hands on with the technologies, and attending classes taught by teradata certified instructors.
i would be happy to bring this level of expertise to your environment, and assist your project in successful data warehouse and business intelligence deployments. please don’t hesitate to contact me directly with requests. i have years of experience in building high quality, high volume, real-time systems on a multitude of platforms. i can customize a course to meet your needs.
please comment if you like this post or find it of value. i would love to hear your feedback on these topics.
according to teradata engineering (ie: todd walter), the teradata engine executes everything in parallel, thus leading to 100% parallel – 100% of the time. at least these were the statements he made during a teradata influencers meeting that i have attended, and these were also the statements made and demonstrations shown during teradata masters training for consultants – which i was lucky enough to be invited to attend.
furthermore, they also made the following statements:
- teradata primarily relies on hash joins
- teradata uses a “primary key” concept – which is the identification of one or more columns for which, the hash result is calculated
- teradata spreads data across mpp environment by calculation of a portion of its hash result
- teradata also is enabled to use a feature called a “join index” which is a many to many table, which can then contain a primary key that enables co-location of different tables’ primary indexed fields
before i get any further…
joe celko wrote a fantastic article about your choices, when to use and apply natural keys, vs the use and selection / generation of surrogate keys. check it out here.
and, if you want to really know how to generate surrogate keys on teradata, then this post written by marcio moura is worth while reading.
yes, there is always the “maybe not so accurate” post, but commonly acceptable information regarding how to define a surrogate key: here on wikipedia.
tom kite (an oracle guru) weighs in with his thoughts on using surrogate keys, a good short read here.
finally, last but certainly not least, a short entry (but really cool one) regarding the use of surrogate keys and slowly changing dimensions is offered by raju bodapati, here. what i find particularly interesting about this post is the following statement near the end: “one good way to address this could be to implement a surrogate key with natural key for the table changing over the course of time.”
this quote reflects the very changing nature of natural keys / business keys, which is a well known fact, albeit they should not change and when they do, they cost the business time and money.
back to basics, and the standards of data vault modeling
the foundations / standards, as i’ve defined them, attempt to solve many of the issues with converting non-temporal data in to a temporal data storage format, while also solving: query problems, ease of load, accountability, usability, and so on. one of the very early (non-published definitions of 1997) of the data vault modeling standards made the statements that surrogate keys were not necessary. however, since that time, i’ve modified the standards – which now dictate that surrogate keys are necessary (or are at least defined to be part of the fundamental standard for the physical data model).
the surrogate keys are not necessary for the architecture of the data vault to “function properly”. however, without surrogate keys, the statement is made and therefore must be accepted, that the natural keys / business keys are non-volatile. there are not many keys which can adhere to this kind of rigor.
now here are some requirements for a “data warehouse” (aka: according to non-volatile, time-variant, etc… definition).
- business keys may or may not be natural keys
- natural keys and business keys may or may not be unique
- natural keys and business keys may change – however, as pointed out by some of the above articles, changes to these “user visible/user accessed/user relied upon keys” may be disastrous from a temporal aspect and an auditing aspect, which both are required by data warehousing.
- surrogate keys used and applied within a data warehouse (any modeling style) are generally there to overcome a) the join issues on many relational database engines, and b) to help bridge the gap over to the temporal world by allowing creation of a versioned record.
let’s suppose the natural key or business key is consistent (stays the same) over time (the temporal aspect), but something else in the record changes. if you are not using surrogate or sequence keys, then it becomes near impossible to create a “new version” of the record and uniquely identify it in the warehouse.
however, looking at it from a data vault perspective, that problem is overcome by the inclusion of a load date field, and possibly a sub-sequence field in the satellite primary key ** note: primary key is not the same thing as primary index in teradata **
in my courses that i teach, i will not only explain the why but provide you with hands on examples of how to do this and demonstrate some of the differences in applied techniques. please contact me today, or sign up for on-line training here.
so, does the data vault model require surrogate keys?
the answer is not necessarily, until you begin to examine the next set of requirements from the standards:
- performance, performance, performance –
- and cross-platform usability (in other words, heterogeneous support for multitude of rdbms, appliance, and newer technologies such as hadoop + hive, or netezza + oracle, or paraccel and sqlserver, etc..)
when you try to use multiple servers with multiple underlying technologies, then linking the data sets together across these environments makes most sense with the implementation of a physical many to many table (aka join index, aka link table,regardless of it being teradata or not).
surrogate keys often take less space, therefore you can store more data in smaller sizes of disk blocks, and smaller numbers of i/o are needed to access the data. data access in a data warehouse costs time, and money – especially in big data environments, so to counter-act this and other ill effects of natural world keys or business keys, the data vault standard says: for your physical data vault model, surrogates are required.
in other words, given today’s changing infrastructure, and looking to store data vault data on multiple heterogeneous platforms, surrogate keys are necessary for transportability, and join capabilities. especially to handle big data in a hadoop environment.
now on the note of keys, identities, and definitions – the data vault model draws it’s definitions and this particular standard directly from dr codd:
a quote from dr. codd: “..database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them …”(dr. codd in acm tods, pp 409-410) and codd, e. (1979), extending the database relational model to capture more meaning. acm transactions on database systems, 4(4). pp. 397-434.
and the follow on from joe celko’s post regarding definition of keys: (to which data vault also subscribes):
“this means that a surrogate ought to act like an index; created by the user, managed by the system and never seen by a user. that means never used in queries, dri or anything else that a user does. ”
this is why my data vault methodology and standards specifically state:
“surrogate keys used in, generated for, and applied in the physical data vault should never leave the data vault, never be shown to the user.”
on to the question at hand…
should i use natural keys or surrogate keys in teradata? which is better for implementing a data vault on teradatata?
the answer, sadly, is it depends. if you choose to use surrogate keys in teradata, then there are the matters of how to implement it, do you leave holes or not? is it done by the loader or an etl tool or the database? the answers to these questions can either limit, or enhance load performance, and at the same time, limit or enhance query performance. however, if you choose to use surrogate keys, then most likely you will need a join-index declaration.
there are several performance documents regarding tuning teradata, one of which is here, and brings to light the following tidbit of interesting information:
” keeping up join index might help, but you cannot multiload to a table which is a part of the join index – loading with ‘tpump’ or pure ‘sql’ is ok but does not perform as well.dropping and re-creating a join index with a big table takes time and space.”
nothing but the best is offered during my world-class training, including hands-on sessions that help you see, use and work with the results. to find out more register for an on-line class here, or contact me directly with your questions.
so is there a way around this?
yes. it’s called the link table. rather than implementing a join-index as teradata has defined it, you can implement a separate link table structure, and generate the primary index on the same field as the primary index of the largest of the two tables. this will ensure co-location of the data set on the nodes so the join does not need to cross amps/nodes to execute. however, this optimizes the join for one way only and thus other queries may need other link tables or link table copies organized differently.
it might not be as fast as maintaining an internal join index, but it gives you the flexibility of adding custom attributes to the link table, along with the ability to co-locate a satellite.
furthermore, the link table is cross-database compatible. the architecture and physical implementation will work on any underlying infrastructure (hadoop + hive, cloudera, netezza, paraccel, etc.. included!) the join index is teradata only.
surrogate keys are an important and integral part of the physical nature of data warehousing, data vault model or not! natural world keys can and should be used – but sparingly, and only if the natural world key is guaranteed to stay consistently single valued 99% of the time (which may be a pipe dream). i believe, after research, that joins are in fact faster (regardless of relational database platform) when using surrogate keys as opposed to natural world or business keys.
a key part of the nature of portability and heterogeneous environments is the link table, instead of the teradata driven join index. the join index may be query optimized within teradata, but it will slow down and inhibit the use of teradata loaders. the link table “appears” as a standard table to teradata, and is simply hashed the same way as one of the parent tables (preferably the largest parent table) to achieve maximum co-location. you can also use any and all teradata loaders for high speed loading by using the link table, but yes – it may or may not take full advantage of the optimizations that a join index offers.
the link table is a standard structure defined within the data vault modeling context. it is a highly powerful and flexible structure when used / applied appropriately. you can already learn more about the link table by purchasing my book: “super charge your edw” – available on amazon, it contains all you need to know about the standards and definitions for the data vault model.
you can also learn more about the implementation of the data vault through signing up for my on-line courses today.
- sequences and identity columns are different – as far as definitions go. identity is an applied function to a column.
- primary keys and primary indexes are not the same thing – as far as teradata is concerned
- teradata uses hashing to “as evenly as possible” distribute the data sets across nodes
- hashing on a surrogate key is the same as hashing on a natural key (or business key)
- surrogate numeric based keys take up less space, therefore generally require less i/o – meaning they are faster at joins, regardless of the underlying technology.
- surrogate numeric based keys are meaningless without being tied to a natural world key, or business key
- natural keys and business keys may be different, or may be the same (just ask my good friend kent graziano about this one)
- natural keys/business keys may change, requiring a multi-key resolution for master data purposes; surrogate keys are good for this when placed in a hierarchical link structure.
so, is this problem exclusive to teradata? i think not. would it be nice if we could use natural world keys/business keys as our primary key (notice the wording here)? yes, it would be wonderful, but given the fact that they change, are sometimes even missing, it is a utopian pipe dream to think that we can apply it in 100% of our edw use cases – hence it is un-enforcable as a common standard.
my deep knowledge and experience in the implementation world allows me to bring this best-of-breed information directly to you. it is also one of the driving factors why i maintain, evolve, setup and own the data vault world wide standards; my experience allows me to continually evolve these standards to suit the needs of the industry – while maintaining their integrity.
having a firm ground of data vault standards to build from ensures your future project success, which is something i care deeply about. i want to personally make sure that i am providing working knowledge derived from application and implementation in the field. contact me today to chat about the possibilities.
i hope to hear from you in the comments below.