welcome back for those of you following my series of posts. thank-you. for those who haven’t read my posts yet – (yes, that’s you), why not? (just kidding), anyhow, this is the third in a series of posts on big data, data lakes, turning data in to an asset for your organization. in this entry i will discuss business keys, data vault, with a focus on machine generated files (which have no good natural business key – or so you say).
machine generated log files have no keys….
well – i beg to differ.
the keys aren’t necessarily business driven, they are machine driven. but somewhere along the line a business user or developer had to make the decision on what to write to the log, how the log would be structured, and what might be “unique” about the log entries.
ok, so where’s the beef?
the beef as it were, is in the details. each log file creates structured content. there are some columns (like url, and referrer, etc…) which are what we call multi-structured. meaning each “entry” can have multiple levels of defined details. in other words, they can be described by discrete or finite mathematics (in fact, they are: url’s are defined by an rfc – a standard)… ohhh there i go again with that stupid standards word… get over it people – you need to understand the world is full of standards, and it’s one of the only ways we can provide stepping stones for future innovation and optimization.
sorry, off on my soap box again. let’s get back to the topic at hand:
how do i key a log file record ?
well – good question. ultimately, you would need to hash the entire row using sha-1 or md5 for example to come up with a unique key. but this isn’t good enough when we introduce two web-log capturing servers. if they capture the same exact url at the same exact time, and provide the same exact response, then they will generate key collision.
but wait a minute… isn’t that what we truly want? hmmm – no duplicates when we store these logs in hadoop for example, does sound tempting, but no – at the end of the day we want to record the fact that different servers actually captured the different log records.
ohhh – i thought the server ip was a part of the log record? it is… but if you have a set of servers in a cluster, all attached to the same external ip and logging to different shared directories, then you can easily get right back to this situation of duplicate hashes.
ok – enough on how to “solve” log file recording problems – i’ve got more, but it gets too technical.
let’s get back to business value here – the nature of our discussion.
you must accept these fundamental tenants to make sense of “machine” generated data, or to treat “machine generated” data as an asset on the books: (no i never worked at google, i actually built log file warehouses back in 1997 before it became a fad or a cool thing to do)
- log files “business keys” are in fact, technical keys – being technical keys they are machine generated
- log files “technical keys” are / must be multi-part (or composite keys) made up of different pieces of the record
- to be truly unique, the complete row must be hashed, but that depends on the purpose you are trying to solve
the questions the business needs to ask:
- when is a single log file entry important or have value to the business? if this ever is a case for the business, then each row must be stored with a unique machine generated hash
- when is the value of the aggregate more important? if this is the case, then the business needs to construct additional business keys (for instance, based on session id), or cookie + ip + browser + date time (etc…)
90% of the time, businesses need to extract value from web-logs or machine generated data by aggregate examination.they change the grain to aggregate together the interesting data, and it is at that point they must assign a machine generated business key that makes sense for tracking to the business.
when this is not the case, then unique machine generated business keys must be assigned to the machine generated data (entire row) for unique identification within the set. sometimes it is necessary to add a computed column (such as server name) in order to enrich the machine generated data so that the machine generated business key can remain unique.
at the end of the day, in order to attach business value to machine generated data, it will require machine generated business keys because the flow rates / volume are prohibitive for any human to attach a natural world smart key.
when these situations arise – the higher the level of aggregation (results), the more value the aggregate results tend to have for the business, and at the end of the day – it will be up to the human to attach a meaningful natural world business key to the high level aggregate results. (in other words, picking the gold out of the sludge at the bottom of the lake – then tagging it with where you found it, how you found it and so on).
hope this helps,
ps: there are no impossibilities, only lack of foresight and vision