you know of sequence numbers, you’ve heard of them – they’re everywhere! they started life in the source systems, and have been with us ever since rdbms vendors have declared that they are the “fastest performing component” for join capacities within the database. in this post i’ll discuss sequence numbers, what they are – why we use them, what they are good for, and why they should never ever leave the database – this includes source systems!
sequence numbers are just that – numerical identifiers that automatically increment for new rows inserted in to a database table. they are supposed to be *meaningless* keys!! they are supposed to not represent anything except a row inside a table somewhere. however, all too often – these sequence numbers get exposed to business users – and the minute they are exposed is the minute they become “business keys.” why? because business users latch-on to anything remotely unique about a record, and begin labeling their records by these “keys”.
why is this a problem? because shortly thereafter someone has a need that isn’t met by the application, or isn’t met by the database – and at that point they try to “alter” the business key (sequence number) to add their own special meaning, because they don’t want the system “overwriting” their data. but that’s off topic and for another day.
sequence numbers in the staging areas should stay in the staging areas. they should be used once, and thrown away. these sequences (if generated for staging) should never leave the staging area – ever. they should be set to “cycle” and “restart”, and staging databases should be truncated and re-loaded for every batch or mini batch cycle.
sequence numbers in the data vault should stay in the data vault. they should only be used as “join keys”, and should never ever be allowed out of the data vault to the business users, nor downstream to the data marts. operators of the data vault should be able to take all the data out of the data vault, and completely re-build the images with data sets, and assign new sequence numbers. there should be “no coupling” of the sequence numbers to any business processes, nor any outside influence. this allows the data vault to continually evolve, and not impact anything outside the scope of the data vault database itself. it also allows you the freedom to move your data vault from one physical database to another.
by the way, it’s not just the data vault that benefits, it’s *any* data warehouse that follows these principles – if the sequence numbers stay in the database, then that de-couples them from the “data” and de-couples them from having any impact in “moving, restoring, backing up” the data set – or changing database engines and having to re-calculate sequences.
sequence numbers in the data marts, should stay in the data marts – (but this is easier said than done, because business users often can access the whole star including the sequence numbers).
in a perfect world…
database engines would number every row automatically – no matter what, and provide “automatic joins” in a data warehousing option, and we wouldn’t have to worry about sequence numbers ever again. but, we don’t live in a perfect world now do we?
yes, yes, i’m aware of rowid, and other database row identification mechanisms – what i’m talking about is a fast join key embedded (as an option when we build a table in the data warehouse) that allows us to automatically number specific sets of data, and define a logical join (today is a foreign key) in the database.
hope this helps untangle sequence numbers, remember: the moral here is: once exposed outside the database tables, they will be forever exposed – and they will have a downstream impact whenever something in the database has to change or alter the sequences.
ever had a bad experience with exposed sequences? i’d love to hear about it…