in the coming posts i will be discussing data as an asset, and turning data in to information. but in order to get there – we have to step back and first understand this notion of data lake, data swamp, nosql, big data, and unstructured data. we need to take a serious look at how all of these systems play together – or can they play together? the problems with today’s data gathering is just that, we create data junkyards and data swamps by simply dumping more data into a common data store – without regard to it’s impact to business, it, and the costs that it bears.
what’s the core of the problem?
well – let’s start from the start. step 1) acquire all data from where-ever, when-ever, and simply load it to a centrally managed store, and call it a datalake. this is supposed to be good how? sorry, but this does not qualify for a data lake. this qualifies for a data junkyard, or a data swamp, or worse yet – data sewage.
that’s right, data sewage. just a bunch of crap data sitting around stinking up the place. more data does not equal more knowledge, nor does it equal better information, nor does it equal good decisions.
data just “sitting there” or “captured and stagnant” is actually the sludge that clogs up the information pipelines to business.
making changes, thinking in business value…
how do i make business value out of all this data? that’s right, it’s data not information – at least not until you process it in some fashion. in fact, it’s all just electronic bits that make up the cluttered sewage. that brings us to business lesson #1:
- one must first identify data before it can be utilized, stratified, organized, clustered, arranged, categorized and so on. rule number 1 to making data useful is to identify it. in other words: tag it.
wait – tag it? what on earth do you mean?
i mean exactly that – scan the data set for relevant terms, and tag the occurring terms or clustered terms you might be interested in – in other words – you need to know what question you are asking before you can scan all the files in nosql / bigdata in order to properly tag it. but what if you tag it with the wrong tags? or you ask the wrong question? well then, you need to scan the files again, and this is exactly what happens in nosql / newsql platforms all the time. especially with unstructured and multi-structured data sets.
hint: with structured data sets – the column names are the tags or metadata, and if followed properly, they identify what the column values are supposed to represent. that means that column names like ” details, and information” are no good, and should be thrown out (hence: data modeling principles 101 people!)
wait, that’s what web-search engines have been doing for years… collecting tags. you mean to say we need to begin utilizing web-search technology on top of big data and nosql to make use of it? yes… it goes back to this fundamental principle:
you can’t measure what you don’t identify
isn’t there a better way?
yes – when thinking about business value, the first tag we should always consider is the business natural key. hopefully this is a real-world understandable natural key that uniquelyidentifies a set of data – otherwise known as a record. but wait, isn’t that what we were doing in relational technology all these years? yes, it’s the hopes, dreams, and goals of all good data modelers – but unfortunately, these information engineering techniques have been lost to the ages. they don’t seem to be taught in schools anymore. 99% of the source systems (erp, crm, financials, etc…) are coded to use surrogate meaningless numbers as their business keys.
this is only valid or helpful if the following criteria are met:
- the number is truly unique across the entire business
- the number (once assigned) never changes – no matter which system the data is moved to
- the number is never re-used (once or if the data set is ever deleted).
ok, at the end of the day – what does the sequence really mean? or the better question: how do you search on just a sequence number? it is useless if you don’t know the number to begin with.
there will be more blog entries on turning data in to information, coming soon. in conclusion i’d like to say: the importance of the business key and the remaining “tags” that help in identifying and classifying unstructured and multi-structured information should not be overlooked. it is the critical first step in turning your data swamp in to a stratified data lake.
thoughts? comments? questions?