this post is all about how-to build the proper data vault model to handle clob, blob, lob, bin, binary, image fields. basically fields that are binary in nature. in this _very_ short post i cover where to put these fields in the data vault model and why. in my new one-on-one coaching section i will teach you how to use these fields in etl/elt.
see my posting about my new one-on-one coaching, it is open to a limited number of signups for a 3 month program – and generally has a waiting list for the start of the next program. contact me now at: firstname.lastname@example.org for more information.
these fields are thorny, problematic, and hard to deal with in a data warehousing sense, and many times people make mistakes in the data vault model and don’t place these fields in the right structures. remember what i teach: that it all begins and ends with the data model architecture. make a mistake in the architecture, and it will cost you dearly!
so, where do we put these fields?
these types of fields belong in their own separate satellite – that’s right, all by themselves. you can add another column to the satellite called a hash-key (if you want). then, you can run the binary data through a hashing function and determine the key. the hashing key can help you with distinct clauses (removing duplicates from the staging table), and with comparisons to the “current” row in the satellite.
why separate the fields?
the point is: separate, divide and conquer. it’s the only way to deal with these particular binary fields. they are long, ugly, and unparsable by standard sql actions. field separation is necessary so that standard structured content isn’t mixed with “unstructured” content. we have no way of knowing how long the fields are, directly comparing the fields to themselves and other rows, and manipulation of these fields typically requires special handling routines which end up being slow to execute.
you asked for it… so here it is: suppose i have a staging table:
create table stg_my_person_record (
primary key (person_badge)
ultimately what you need in the data vault model should look like this:
create table hub_person (
primary key (person_sqn)
create table sat_person_info (
primary key (person_sqn,person_info_load_date)
create table sat_person_image (
primary key (person_sqn,person_img_load_date)
what this will do is allow you to construct a hash key (make the length of the hash either 32, 64, 128 or 256) depending on how unique you want it to be, and how much “tolerance for misses” you can take. the smaller the hash key, the larger the tolerance (the more changes you might miss).
by separating the satellite with the image, and the rest of the binary fields, you can effectively model the data vault and produce the proper architecture and handling mechanisms needed to run the etl efficiently.
don’t forget to contact me about reserving your spot in my one-on-one coaching session opening up soon!
hope this helps,