Q: What’s the best way to Test a Data Vault?

a: follow your nose…  (just kidding).  in this post i’ll see if i can begin to describe the procedure of testing data vaults.   i have a full methodology with recommendations, example test plans, rules and standards to follow for making this happen, and i offer this methodology during on-site training

elements of testing

there are many different elements to creating successful testing procedures.  in this post i will outline just a few thoughts and ideas that would get you going.  if you have other ideas, or questions about this process – please feel free to add them as comments on this entry.  i’d love to hear what you’re thinking!  of course, i can’t do justice here with just a single entry alone, so there will be a lot of information yet to come.   however if you want to make the most of your time, and you want to expidite the testing process then i highly recommend that you sign up for my one-on-one coaching section.   elements of testing include:

  • test plan (generally a word doc), refers to test cases, and outlines each one.
  • test cases (specific data sets, instructions on what to do, expected results, and tolerance levels – what would pass, what would fail)
  • testing tools – generally a document defining each tool used to test and what it’s purpose in the testing is.

for data used in testing i would recommend the following:

  • real, raw data sets
  • generated test data sets

using raw data

data vault data is generally raw data sets.  so in the case of the data vault reconciling to the source system is a recommended test.  this can be reconciling to the flat-files that arrive, or reconciling to the source databases.  sometimes there is no “system” to reconcile to because the data arrives on a web-service.  in this particular case, i would instruct you to store the data in a staging area – just for testing purposes later!

using generated test data

it is absolutely vital that you test the processes with predetermined generated test data.  reconciliation is one thing, but testing “domain alignment” and “normalization”, along with “missing keys, nulled attributes” is another.  generating test data that is “sparse”, incomplete, and in some cases fully complete and fully correct allows testing of the relationships, the zero-key generation, the start and end-dates of specific records (ie: the delta processing detection), elimination of duplicates, etc…  there are many many reasons why you should test with generated test data, but keep this in mind: each row of generated test data must mean something, must accomplish a specific goal, and must be documented 1 for 1 in a specific test case.

when you use generated test data, you can “foresee” what the outcome should be, and adjust the expectations to match.  you can dictate in the test case what the expected result should be, therefore – a consistent test case can test new releases (with the same data set over and over again).  it will determine pass/fail states for your baseline data vault data warehouse.

also, with another set of generated “random” test data, you can actually test volume, joins, and performance of the database.  these are functional tests – to make sure the partitioning is right, the performance is acceptable, the indexing works, etc..  these volume tests are really good for testing load cycle performance as well.  you can project what you need to do in order to make the system run faster.

i use a tool called “rowgen” from a company called cosort (iri, inc) to make my tests accurate, fast, and easy. http://www.cosort.com/products/rowgen  there are hundreds of test data generation systems, but what i really like about this product are the following points:

  1.  it’s written native c – it’s fast!!!  with assembler level components!  (many others i’ve seen are way to slow to generate “heavy load”.  i once used rowgen to generate 986mb of data inside 4 hours on a dell poweredge 2650 with 2 cpu’s and no hyperthreading, 4 gb ram – running windows server 2008 32 bit.
  2. it has a set of scripting commands, that allow me run lookups, pre-generate files, generate associations across files, etc…
  3. it can take advantage of multi-core cpu power…  it’s multi-threaded
  4. it’s a command line interface – letting me script the test cases and data generation.
  5. i can setup rowgen to “use” some columns of real-data from real-source files.
  6. it recognizes patterns, and uses a statistical algorithm to allow generation spread of each column for sparsity.
  7. it can run on a mainframe if necessary, or linux, or unix, or windows…
  8. it’s cheap!!!  compared to some other systems, it used to be a $1500 entry point (check with them for current pricing).

there are many many more features and reasons why i like rowgen, but if you do talk to them, just tell them i sent you.  *** note: i am not a reseller, i do not receive any fees for talking about the product, i do not receive any monetary compensation here, and i do not endorse any product i haven’t used or don’t like.  *** in this case, i like the product, i’ve used the product, and i endorse the product.

balancing to real-data

there are some tests that should balance the raw data set to the source system, these balancing routines must be tested very carefully to ensure they don’t produce false-positives, or false-negatives.  once the balancing routines are in place, then testing can be fairly easy.  some suggestions: run: sum(x), count(y), average(z), min(a), max(b), count(distinct c), count( null(d)) etc…  across the satellites for a specific load date, tie together these images against a single load date in the staging area, if they don’t match – you’ve got trouble.  remember: staging areas may contain duplicate rows, where data vaults do not.  so, before balancing against a staging area, remove the duplicates!!!

then: develop similar balancing routines that match the staging (duplicates and all) to the flat files (if the source is a flat file), or the source database (if the source is a database).  compare the outputs.

these types of routines should be coded so they can be run once a week, or daily (depending on load frequency and performance).  they should send emails when things fall out of balance.

what about testing to see if all the data has arrived?

now hear this!!!  source feeds tend to a) grow on a regular basis b) fluctuate with some level of consistency (within a given high/low average)  – these types of things need to be accounted for on “auto-testing” routines.  in other words, if a feed “grows” by an average of 100 rows a day, then you should  set a threshold that says, if i receive anything less than 75 new rows a day, send an email, fail the load… etc…  something may be wrong.  on the contrary you should set another rule: any time i receive more than 125 rows a day, see above….

thresholding for automated “sanity checks” is yet another testing procedure that should be setup.

now, regarding using real data as test data – once a test data set has been selected – it must be backed up!!  it must be used over and over again to test for pass/fail.  it also must be documented in a test case that this is real data that is not to be shared.

remember: testing with an empty data vault, followed by a single day’s load, followed again by a day-2 load is a must.  a day-2 load will allow you test delta’s, and duplicate removal from the staging area.

interested in more information? want test case examples?  contact me for training.

dan l

Tags: , ,

2 Responses to “Q: What’s the best way to Test a Data Vault?”

  1. Binnie 2020/02/12 at 7:44 pm #

    No capital letters and lack of punctuation makes it difficult for the reader. Perhaps you could have had someone QA this article?

  2. Dan Linstedt 2020/04/08 at 7:56 am #

    Hi Binnie, Thanks for the feedback. The website went through a transition process (automated script editing). It changed everything to lower case. Not much I can do about it. Thanks, Dan

Leave a Reply