i am a newbie to hadoop – so to all those die-hard hadoopians, please forgive my ignorance, and correct me where i’m wrong. this post is a first-look (opinion based only) about data vault models on hadoop environments.
hadoop way over-simplified:
for those of you unfamiliar with hadoop, you can think of it in this very oversimplified manner:
a set of code running in parallel on a set of gridded machines with all the embedded rules for distributed computing handled for you under the covers. within hadoop is an hdfs (hierarchical distributed file store/system) – which takes documents, xml, files, fields, data – essentially, and stores it across the nodes. also within hadoop is hbase – a “semi” or quasi-structured environment – storing data in essentially key=value pairs.
in other words: hadoop (conglomeration) is a large distributed data management / storage / retrieval system with a key=value pair base architecture. you can think of it as an open source columnar-like data management system.
however, it’s deeper than that with the key=value pairing. if you want to know more about hadoop, there are tons and tons of resources available that discuss this subject in detail.
ok, so where does this leave data vault modeling?
before i go on, and discuss the dv modeling aspects, let me say one more thing: today there is no “common data access layer” like there is with sql and relational database systems. to work with, in and around hadoop, you must write java code… or invest in a tool that writes java code, or invest in a tool that interfaces with hadoop (in essence writing run-time java code). etc…
so: the data vault model. well, the dv model in a “physical” sense is great on relational systems. allowing the designer all the goodness of flexibility, scalability, tracability, and so on that they would want in a data integration project. the data vault model brings to the table loads of goodness in utilizing relational data base management systems in an mpp and distributed data fashion. a dv model can & is split by both horizontal and vertical partitioning mechanisms – which when combined with highly tuned parallel engines, and fast infrastructure, can allow the rdbms systems to operate in maximum performance capacities.
but what does this have to do with hadoop?
well, i’ve spent the last few months reading a bit of literature, trying to understand hadoop – and i’m getting to the point now where i’m nearly ready to install an instance, and play with a dv model – per say – on the instance. so, more on that later….
if you want to use hadoop – then by all means, go ahead and do so! if you want to use dv on hadoop then there are a few things you should realize before getting going:
1) your dv model should remain “mostly” logical
2) you should look into generating hadoop data access code!! to handle the physical nature, you will need a mapper class, a combiner class, and a reducer class to handle the logical structure.
3) you can translate the data vault components (hub, link, satellite, and the applied derivatives) using java inheritance code – possibly writing base routines that apply to all hubs, all satellites, etc… then over-ride each with the different structural mappings.
hadoop under the covers will then map the routines to the key=value pairings (at least this is what i understand so far) – and as i get further in to implementation, i might change my mind!
anyhow, please tell me what you think of this entry – if you want more, if you want some examples, etc… leave a comment below!
ps: you can find out more about data vault at: http://datavaultalliance.com/training