this is an introductory look at data warehousing as an industry and it’s application or use with the hadoop base platform. i cover only some of the fundamental challenges, benefits, and issues we face when we are told to move existing data warehouses on to a hadoop base platform.
the buzzword: hadoop and big data
the current “buzz” is all about big data, and whenever anyone mentions big data, immediately a hadoop based storage system comes to mind or in to the discussion. it seems as if businesses “all of the sudden” have severe memory loss regarding relational database engines, and the hundreds of millions of dollars in sunk investment in to their infrastructure.
some immediately begin to ask the questions:
- can i get my data warehouse on hadoop?
- why do i need a data warehouse if i have hadoop?
- can’t hadoop just “replace” my relational database engines?
and of course, they ask probably a hundred more. the smart businesses will realize that hadoop is a complimentary platform to your existing infrastructure. it is not a replacement for all of your existing systems. by the way, hadoop is a management algorithm for large, scalable data sets. it is not even a full data management engine!
hadoop is a free, java-based programming framework that supports the processing of large data sets in a distributed computing environment. it is part of the apache project sponsored by the apache software foundation. http://searchcloudcomputing.techtarget.com/definition/hadoop
so to really bring the focus home, comparing hadoop with rdbms (relational database management systems) is like comparing apples to oranges.
in another post i will dive deeper in to a basic overview of hadoop and the solutions “built in, around, and on-top” of the processing engine (things like hive, hivedb, pig, cloudera, mapr, hbase, and so on). for now, we will go back to the topic at hand.
can i get my data warehouse on hadoop?
yes, but it’s not very easily done, and it may not be the right platform for all edw processing! in other words, there are some things that can and probably should be left on a relational database system (like teradata, oracle, sybase, sqlserver, mysql, db2 udb etc…).
some of the key points of the hadoop based technology include: (some of these points are paraphrased from the following document: http://infolab.stanford.edu/~ragho/hive-icde2010.pdf)
- the pure-store is a write-once, read-many solution (meaning no updates and no alterations of existing “structures” / files)
- it is a full file-store, without any referential integrity (yes, this is what you give up to get performance – all that management overhead out the window in order to achieve fast performance)
- partitioning is wonderful, but the “columns” that the data is partitioned by, do not exist in the data set anymore – they are put in as part of the directory tree structure. (i’ll explain why this is interesting in a bit)
- partitioning is truly file splitting in to separate physical directories (sometimes separate physical machines)
- hadoop is a load and go file management system – meaning you copy raw files in to the hadoop platform, there is no such thing as “etl” to getting files in to hadoop. they are copied in, and then the transformation rules must be written in code.
- right, el-t is truly equal to “file-copy (el), followed by insert into new file from old file(s) combined with transformation rules” – yes, to transform old data, you must make/create new files.
some benefits of this approach include:
- rapid loading (basic file copy, how much easier/faster can it get)?
- rapid transformation (because there is no referential integrity, it is easy to create new “aggregations” of data)
- distributed computing with mpp / shared nothing algorithms
- easy easy easy to create new “aggregates” and create a feeling of “self-service-business-intelligence” for users that can “write the code” to get what they want
- automated compression (in some hadoop implementations – and some commercial systems) file compression is built in, reducing the amount of storage necessary to house the data
- no delta processing necessary! in fact, because it is a file system, “load and go” is about the only thing that can be implemented. this can be seen as both a benefit and a drawback. if you want “delta sets”, you have to copy the new file in, and run algorithms for deltas that produce new delta files.
- schema-less (to a degree) storage. i say to a degree, because you still have to define the columns, and in some cases the base data types for the file in order for the code to “map” to the elements. on the other hand, there are some types of files that just work natively (already mapped), like xml. and still other documents (like text files) that simply work based on internal searching for tags. schema-less is both a benefit and a drawback.
- no sql – well, you don’t have the limitations of sql – however, you don’t have the benefits of sql access either.
- no “normalization” to speak of – it is not necessary to normalize data sets, in fact, it’s discouraged – to avoid “joins” across node sets.
ok, so fundamentally hadoop is a managed file store with capabilities for fully distributed file processing rules. sounds almost like ab-initio (one of the long-time etl tools that have been in the market place – they did the same thing with their co-operating system, and etl scripting code).
some drawbacks of this approach:
- no referential integrity
- lack of proper partitioning by the architect/engineer or even end-user who simply “loads” a file, can easily cause havoc (a single hot node with all the activity). architecture and file layout become critical but with lack of governance or management, this is lost on the business. this is something that teradata does really really well (manage the partitioning for you according to the data model architecture (primary index) selection.
- well: lack of proper partitioning leads to serious loss of performance
- constant “re-partitioning” as data sets are dumped/loaded in *may* be required, which is why you see a lot written about hadoop “addressing the machine generated data sets” so well. machine generated data sets can have a single consistent partition, where “data warehousing data sets” may vary over time – causing the need to re-partition. which in the case of hadoop means: you guessed it! moving all the files across the network to put them on the differently distributed machines!
- lack of “data modeling architecture” – lets face it, “users” are simply lazy, and if they don’t have to do data modeling to make a system work, well then, chances are pretty low that they will do it. this is a drawback from a governance and management side. with hadoop, you can easily end up with hundreds of files, duplicated all across the system, with no real idea as to what columns are there, who’s using what, and how it’s being applied / loaded / used. remember: i’m talking about files that are “not machine generated”. in a data warehousing sense, these files are source system, operational files, xml external files, and so on. and if we switch the edw to self-service, well then – where’s the governance on the business users who create new files from existing files that are already stored?
- ok, lack of governance (governance and management are keys to successful use and application of hadoop). look – if you don’t have it in the referential integrity layers, then you must have it in the human management layers. a “dba for hadoop systems” is just as important as having a dba role for relational systems.
- little to no joins – well, just as this is a positive in some cases, it is also a negative. this leads to replicated data in multiple file sets. the more data is replicated, the more it can cause /create disparate bi answers!!
please don’t mis-understand me. i am not saying that hadoop is a bad solution, quite the contrary. my point here is as follows. if you are going to use hadoop or a hadoop based solution then:
- setup a dba role for the engine
- setup governance procedures for “loading data in to the system” and for accessing data / creating new data sets in the system (process driven / architecture driven management)
- link hadoop platform with your rdbms engine – using the power of each respectively.
- attempt to make the hadoop platform “seamless” to the business, so they don’t know/don’t care where the data is coming from (relational or hadoop)
- augment your existing warehouse with hadoop solutions
- use hadoop for document stores, and xml processing
- use a tool like pentaho kettle, and it’s bi reporting platform for gui access to hadoop
- control who accesses the hadoop backend – ensure they are trained properly to write mapreduce code
- use a versioning system (for the file definitions, and the code that accesses them).
why do i need a data warehouse if i have hadoop?
data warehouses are not “on hadoop or on relational systems” strictly speaking. in fact, this shouldn’t even be a question (but i’ve heard it from customers). data warehousing is a concept of storing historical data, then analyzing it for trends and analytics. hadoop is a platform for storing distributed files. so, you do need a data warehouse if you want business intelligence or analytics. whether you put the entire system on hadoop platform, or you put the entire system on relational – it won’t matter.
in fact, i recommend you leverage a mix of both platforms (hadoop and rdbms) for a seamless business intelligence environment.
hadoop in and of itself is not a means to an end – it is not there to “replace” the data warehouse. even with a full document only solution stored on hadoop, you still have all the other issues (temporal access, delta processing, column management, aggregation, transformation) etc.. that must be executed in order to make the data meaningful.
can’t hadoop just replace my relational database engine?
no, it can’t. not in it’s raw form. in order to “replace” relational database engines, there is one more piece of the puzzle that is necessary: metadata management, and structured access. this is where projects like hive, hbase, hivedb, pig, and so on come in. including some commercial vendors like cloudera, and mapr for instance.
they offer (among other benefits) the additional metadata management layer, and in some cases, sql – like access (similar based sql statements that are then translated to hadoop execution code (java, perl, python, ruby, etc..)
hadoop (in its base form) again, is just a distributed file management platform.
now, there are some technologies (that i just mentioned) which allow you to move off relational databases and on to the “hadoop system”, but there are benefits and drawbacks to those as well. the largest reason (i’ve read about) for businesses to move is “to get away from the referential integrity restrictions” which leads to massive performance gains. but again, they experience some of the drawbacks that are listed above.
conclusions and thoughts
in other entries i will dive in to some of the different meta management layers (like hive and hbase) and discuss the pros and cons of specific data modeling components there. my thoughts are as follows: hadoop is an interesting platform, it allows us to easily construct a “persistent staging area” without moving data in to a relational system first. in other words, simply “copy” your “staging file” in to hadoop, partition it by load date, and voila – you have a managed set of persistent staging data.
yes, at that point you have to write code to get your data “out”, but the gui technology and etl vendors are working furiously to connect to hadoop solutions. in fact, pentaho kettle (etl engine) does this already, and there are jdbc connectors that allow you semi-sql access to hadoop file stores (particularly through hive or hbase which is another story).
hadoop is a key-value store at the end of the day, with a single row key, with no referential integrity.
i don’t recommend you throw out your relational data warehouse, just as i don’t recommend you “simply decide to move everything to hadoop” – it’s too much to bite off at once. i would suggest integrating a hadoop store and making the technology seamless. however, with the addition of hadoop technology i also recommend you add governance and management so that you do not end up with a “data junkyard” in a hadoop storage system.
thoughts? omments? ideas? i’d love to hear from you.
ps: get free training and video lessons at http://datavaultalliance.com/