this post covers my opinion in what people today call a data lake. i will discuss what i think needs to be done to clarify the terms, and why if left unmanaged, it quickly turns in to a data junkyard or polluted data lake. on the flip side of the coin, today there is still valuein leveraging your existing relational database for your data vault data warehouse.
how are marketing and industry defining data lake?
well, tons of definitions abound, let’s take a look at some first before i give you a sense of what i call a data lake:
“that is the concept behind booz allen hamilton’s “data lake,” a groundbreaking invention that scales to an organization’s growing data, and makes it easily accessible.” http://www.boozallen.com/media/file/ta_datalake.pdf
umm really? do i really need to believe that this concept / this idea is a) groundbreaking, and b) invented by booz allen hamilton?? i think not. sorry, bah – but it isn’t your concept, you didn’t invent the term, and it certainly is not ground-breaking conceptually. it has been around for a long long time, we just used to call it “staging area”
as if that weren’t enough to swallow, let’s take a look at cap gemini:
the challenge for it is that as the value of information has increased so the motivation of the business to accept slowly evolving single solutions has decreased. the business data lake looks to solve this challenge by using new big data technologies to remove the cost constraints of data storage and movement and build on the business culture of local solutions. it does this within a single environment – the business data lake.
the business data lake is not simply a technology move. it is about changing the culture of it to better match the business culture. the historical battle between business unit independence and the centralizing ambitions of it and corporate management has proven to be an unwinnable war.
the business data lake addresses the challenge by building a single culture and concentrating on the areas that deliver true value. http://www.gopivotal.com/sites/default/files/the_principles_of_the_business_data_lake_2013-12-02_v07_web.pdf
oh my goodness, now cap gemini is starting yet another divide between i.t. and business!! they are promoting a concept of business data lake. well, i beg to differ yet again. this is not a data lake (business or otherwise). all they do here is throw a bunch of marketing terms at a wall, and expect something to “stick”.
exactly what isa business data lake? they don’t define it. furthermore, the challenge or the original problem and their conclusion is actually solvable by proper business intelligence systems. it has nothing to do with a real data lake.
we have been trying to align business with it for centuries, and it has nothing to do with “data lakes.” delivering true value? that’s the job of a really good business intelligence system!!
then, i ran into this: (both of the following definitions actually begin to make some sense)
a data lake is an information system consisting of the following 2 characteristics
1) a parallel system able to store big data
2) a system able to perform computations on the data without moving the data http://datascience101.wordpress.com/2014/03/12/what-is-a-data-lake/
as quoted in forbes: (cross-linked here) https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/
the difference between a data lake and a data warehouse is that in a data warehouse, the data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed.
what is my opinion?
data lake is closer to a combination of the above two definitions. in my opinion, a
data lake is a place where raw data lands, has no predetermined structure enforced on it. however this is where the definition shifts between data lake and data junkyard.
a data junkyard is a “raw data dumping ground”, a bunch of haphazard files loaded to an mpp store (think hdfs) with metadata on top that dictate where the file itself lives, but there is no business value, and no technical value, and no one understands what they truly have. simply new data show’s up and is “archived” for lack of a better word.
there is a fine gray line between data junkyard and data lake. to turn a data junkyard in to a data lake, two things must happen:
1) the data itself (possibly semi-structured, unstructured, or structured) needs to be identified and categorized. because if you don’t identify what you have, then how in the world are you going to make business value out of it?
2) questions need to be asked of the data, either data mining, or queries – but some form of question is aimed and fired at the data, to try and apply the data in a business context downstream. perhaps, several hops later – which then would qualify a “data lake” as a data staging area. in other words, it’s acceptable to assign it value (for processing reasons) as well as downstream business value.
data slew, data swamp? conclusions?
but – with no identification, and no utilization – the data is sitting in the data junkyard (or data slew or data swamp) – all of which are applicable here.
so in the end, the data vault model, methodology, architecture all dictate the use of big data, and today, i advocate the use of a data lake (within the bounds of the definition offered above) as a staging area for data sets that arrive too quickly and in too much volume, or with too much structural variation for traditional relational systems to handle. i also advocate the use of the data lake for data mining activities and deep analysis, questions that cannot so easily be expressed in sql ad-hoc queries.
in the end, no matter what you call it (data lake, data junkyard, data slew, data swamp, data staging area) – they are all the same thing: a place for data to land while you decide what to do with it downstream. it just so happens that it loves acronyms and “new band wagons” – so here we go again, another ride on what was once known as the merry go round.