Sequence Numbers – for EDW & BI projects

you know of sequence numbers, you’ve heard of them – they’re everywhere!  they started life in the source systems, and have been with us ever since rdbms vendors have declared that they are the “fastest performing component” for join capacities within the database.  in this post i’ll discuss sequence numbers, what they are – why we use them, what they are good for, and why they should never ever leave the database – this includes source systems!

sequence numbers are just that – numerical identifiers that automatically increment for new rows inserted in to a database table.  they are supposed to be *meaningless* keys!!  they are supposed to not represent anything except a row inside a table somewhere.   however, all too often – these sequence numbers get exposed to business users – and the minute they are exposed is the minute they become “business keys.”  why?  because business users latch-on to anything remotely unique about a record, and begin labeling their records by these “keys”.

why is this a problem?  because shortly thereafter someone has a need that isn’t met by the application, or isn’t met by the database – and at that point they try to “alter” the business key (sequence number) to add their own special meaning, because they don’t want the system “overwriting” their data.  but that’s off topic and for another day.

sequence numbers in the staging areas should stay in the staging areas.  they should be used once, and thrown away.  these sequences (if generated for staging) should never leave the staging area – ever.  they should be set to “cycle” and “restart”, and staging databases should be truncated and re-loaded for every batch or mini batch cycle.

sequence numbers in the data vault should stay in the data vault.  they should only be used as “join keys”, and should never ever be allowed out of the data vault to the business users, nor downstream to the data marts.  operators of the data vault should be able to take all the data out of the data vault, and completely re-build the images with data sets, and assign new sequence numbers.  there should be “no coupling” of the sequence numbers to any business processes, nor any outside influence.  this allows the data vault to continually evolve, and not impact anything outside the scope of the data vault database itself.  it also allows you the freedom to move your data vault from one physical database to another.

by the way, it’s not just the data vault that benefits, it’s *any* data warehouse that follows these principles – if the sequence numbers stay in the database, then that de-couples them from the “data” and de-couples them from having any impact in “moving, restoring, backing up” the data set – or changing database engines and having to re-calculate sequences.

sequence numbers in the data marts, should stay in the data marts – (but this is easier said than done, because business users often can access the whole star including the sequence numbers).

in a perfect world…

database engines would number every row automatically – no matter what, and provide “automatic joins” in a data warehousing option, and we wouldn’t have to worry about sequence numbers ever again.  but, we don’t live in a perfect world now do we?

yes, yes, i’m aware of rowid, and other database row identification mechanisms – what i’m talking about is a fast join key embedded (as an option when we build a table in the data warehouse) that allows us to automatically number specific sets of data, and define a logical join (today is a foreign key) in the database.

hope this helps untangle sequence numbers,  remember: the moral here is: once exposed outside the database tables, they will be forever exposed – and they will have a downstream impact whenever something in the database has to change or alter the sequences.

ever had a bad experience with exposed sequences?  i’d love to hear about it…

dan linstedt

Tags: , , , , ,

2 Responses to “Sequence Numbers – for EDW & BI projects”

  1. marius 2010/06/03 at 6:00 am #

    Hi, Dan

    It sounds like you have a similar view on sequence numbers as dimensional modeling. Kimball, however, has a concept of “smart keys” (keys that has some sort of embedded meaning in them). Examples are date surrogate keys in a format of yyyymmdd. This means that the key is no longer meaningless. Kimball cautions to stay away from smart keys, but does allow it for date keys. Reasons being that it makes it easier to partition fact tables on time.

    What is your view on smart keys.

    Kind regards

  2. dlinstedt 2010/06/03 at 6:06 am #

    Hi Marius,

    Smart-Keys have been around way before Kimball ever came on the scene. They were put to use in the Legacy systems, the old-mainframe days – and can still be found in many of the source systems that feed data that were written in the 60’s and 70’s, and even in the early 80’s. Smart keys are helpful to some degree, however they can be dangerous if “lost” or can’t be recomputed based on the data set, or when computed from the data aren’t unique enough to serve the appropriate needs.

    EXTREME CAUTION needs to be taken if smart keys are created in the EDW as Primary Keys (replacing the sequence number surrogates). The challenge is (as stated above): 1) how to compute them from the data set – consistently, 2) how to compute them as 100% unique.

    Regarding Date Keys – yes, you can do this, and yes, it’s numeric, and yes it works. I’ve also converted the DATE to a JULIAN date, and used that representation as well. This is all in keeping with setting up a SEPARATE time dimension.

    Hope this helps,
    Dan L

Leave a Reply