really now, what is trash data? data gone bad! you know as i do that data is an asset. what most people take for granted (or believe) is that only “good data has value.” i’m here to tell you that belief couldn’t be further from the truth. by now, i’m sure you’ve heard of: if it isn’t broken, then don’t fix it. right? well, how do you know what’s broken if their is no bad data? i guess your stuff doesn’t stink right? well, read on to find out why bad data really can harbor a gold mine of information.
headlines seen in traditional bi shops
- don’t provide “bad” data to business users, they’ll end up not trusting the data warehouse
- “bad data” should never be brought in to the warehouse, it has no value
- “bad data” should be changed in to good data before presenting it to business users
- “bad data” = “bad decisions” in a business intelligence solution
- storing “bad data” only contributes to the “big data” problems
- “bad data” can’t be loaded properly, ever…
too often, i hear many of these (and many more) headlines shouted by the bi / edw team members, and even the business users chime in… they seem to all chant the statements: “just put a band-aid on it” – although they are not this direct, they basically tell it: “just fix the bad data, make it good before giving it to me…” what? as we will find out, this is a very bad practice to continue in business intelligence projects.
just what is “bad data” anyway?
bad data is a symptom – seen as a result of a different problem. bad data is often defined as: data that does not meet the business users’ perception of today‘s reality. in other words, data that “breaks business rules”, “doesn’t align with today’s wishes of the business user”, or of course, data that is purely “wrong”. but really, what makes the data “wrong” in the business users’ eyes?
the real question is: what causes bad data? there are only two fundamental causes of bad data:
- perception (it and business users) : i.e. data does not align with the business users’ perceptions
- process errors (encoded along the way, or process failures, missing edit checks) : i.e. a process messed up somewhere causing faulty data to be produced, or allowed faulty data to be entered.
really there is no other explanation for “bad data”, so why all the fuss? what’s all the hubub?
in reality, data, in and of itself is neither bad nor good, it just is… it exists, therefore we must manage it, categorize it, allocate it, file it, change it, and delete it (when appropriate). data are just a set of facts.
ok, so given that statement – the biggest problem with bad data is that it get’s a bad wrap when it falls out of alignment with what the business believes to be true. yes, you heard me….
it’s all in the perception baby!
now, here’s the thing: if the business requirement says: “today every account balance must have one and only one customer attached to it.”
what happens when we load a data warehouse from an auditability perspective? well, when we load “recent” history (past 3 months), then most likely this statement/this requirement holds true. everything is hunky-dory (going well). however, when we begin to load data that is 5 years old, or even 10 years old, or perhaps we load external data delivered from a system outside our company… what do we find?
- account balances without customer numbers
- customers without account balances
- null account balances without customers
- accounts tied to two or more customers
- single customers tied to two or more accounts
you see, none of these errors (as a reality) are very important by themselves. it’s the results i say, yes, it’s the results of the pattern analysis that make all the difference in the world.
going gold hunting….
where do we find the results? what difference do the results make?
- you need to understand why these errors are happening
- you need to establish what percentage of overall total data that these errors are happening to
- you need to establish how often these errors occur over a time-line
- you need to find out if these errors are still occurring
- you need to know where these errors are coming from
as sure as i stand here and speak of this, some of you are cringing in your seats…. never been through a real audit hey? well no matter… the point is this: as a business manager / director / executive – you should care about the impact that these errors and percentages of occurrences have on your business process. why? because it is costing you money every time a business rule / business perception is broken! i garauntee it!
as a technologist, or an edw practitioner, you have a responsibility to explain when and how often the data doesn’t meet the business requirements. as a business person you have a responsibility to find out why it’s happening, and what is causing it.
the gold is in the finding and fixing of the business problem – closing the gap, bringing the business process (whatever it may be) in to compliance / alignment with your expectations.
once this is done, the problem data disappears, is no longer generated, and thus, you now have the start of a full total quality management (tqm) process that involves the data warehouse.
gold, gold, everywhere…
so you see, if you “change / alter / modify / cleanse” the data before you put it in your data warehouse, you and the business users loose out at the opportunity to find and fix problems that exist in the source systems. you are putting a band-aid on the cut, but you are not stopping the bleeding.
to have the real-gold, you must have a raw and integrated by business key enterprise data warehouse… that’s just what the data vault gives you (but don’t forget, you need to use the methodology in conjunction with the model in order to achieve the full effect).
so, if you still believe there’s no value in bad data, then you’ve missed the whole point of business process alignment, and i hate to say this but: go back to school, preferrably th school of hard-knocks where a business is attempting real and true alignment of their operational systems.
thoughts? comments are always welcome.
ps: you can find out more about the data vault model & methodology at: http://datavaultalliance.com