as you know, sanjay and i have been hard at work on this topic for the last two years. i am preparing to release some of the findings soon, in fact, i will release some thoughts, and difficult questions that we propose to the market place in this blog entry. i encourage you to offer your feedback to the questions we pose by supplying replies to this blog entry. note: this is a result of discussions and hands-on that both sanjay & i have had over the past two to three years. these are my opinions.
defining our #nosql platform
there are so many #nosql platforms and now, newsql platforms in the market, that i thought it wise to discuss the very platforms that sanjay and i have actually been working with – and why. we have several platforms we’ve been testing with. the first, is pure apache hadoop & hdfs – let it be known that when i say “hadoop” in this post, that is the context under which i am discussing the tests. when i say “hadoop platform” i am referring to the additional components such as sqoop, flume, and the rest of the applications on top of hadoop. when i say the hadoop ecosystem, i’m referring to all of the above plus the hardware we are running on.
our second platform is a stripped down version of cloudera – sanjay removed many of the additional tools that we did not wish to test with.
our third platform is hadoop + hbase – specifically for hbase testing purposes.
our fourth platform is hadoop + hive (internal and external)
our fifth platform is now hadoop + stinger
what are we trying to accomplish?
we are trying to put the entire enterprise data warehouse (including data vault 2.0 modeling, analytics, real-time feeds, batch feeds, cleansing, business rules, managed self-service bi, quality, metrics, control / security, access, etc…) all on the hadoop ecosystem. our primary focus is to answer the following questions:
- what pit-falls are there when compared to a standard / traditional relational db for edw ecosystem?
- what positives / good things are there when compared to a traditional relational db for edw ecosystem?
- what best practices should be standardized on in this environment?
- what happens to data modeling in the hadoop ecosystem that we need to be aware of?
we have about 50 more questions but i won’t bore you with the list. there are other, far more pertinent questions which i respectfully ask you to reply on, that i’ve posted further down in this blog.
a bit of history around data vault 2.0 system of bi
for those who don’t know, dv2 is a system of business intelligence that includes: agility, cmmi, six sigma, kpa/kpi’s, pmp, best practice data modeling, best practice data integration, separation of hard & soft business rules, and more, all for the purposes of building an enterprise data warehouse in an incremental fashion with rapid build out.
ok, back to point….
there are pitfalls of these hadoop based ecosystems that vary with the physical setup and implementation choices. there are over 50 different significant pitfalls (some are mitigated by other technology, some are not). here (below) are a few – again, keep in mind these are dependent on the physical layout, storage, and implementation chosen.
- hdfs does not allow update in place to “files” (this can be seen as a benefit actually)
- hdfs does not understand indexing – ie, every access is a full file scan
- hadoop has a big learning curve for non-programmers
- hdfs can implement many different “physical storage formats” – causing problems for, or helping performance.
- hadoop on mpp still exhibits the data co-location, and cross-node “join” problems.
- hadoop itself, doesn’t understand joins without writing map & reduce code.
- hadoop schema on read makes it extremely difficult to optimize performance for ad-hoc queries
- the general population doesn’t understand schema on read vs schema on write
- data modeling doesn’t/can’t/won’t apply to all data absorbed by hadoop (think word docs, images, movie files, sound files)
- every hadoop based “solution” is different than the last, meaning consistent design doesn’t work across “hadoop platforms”.
- there is no governance over hadoop – anyone with a login has access to everything in the hdfs (hadoop file store)
- there is no metadata lineage / data lineage in hadoop, no way to know where data came from, who’s accessing it / when, why, or even what they are doing with it.
- hadoop is not transactional driven, it’s a file store – it is “batch load” of a file.
- hadoop keeps history of everything all the time, managing that history can be a nightmare without additional tooling on top.
therefore, i will make the following theoretically applied statement: everything in hdfs / pure hadoop, is just a data dump, a data junkyard – it is not nor will it ever be, a data lake. not without mitigating the issues above, which in turn requires additional solutions to manage. at the bottom of this post, we’ll get in to what this means to a data vault model (nothing else at this time), there is more to come…
benefits / positives over a traditional rdbms edw
now, let’s take an opinionated look at what i’ve found to be some of the benefits. these are nothing new – most of these benefits have been described as the reason for hadoops’ existence. this is not rocket science, but it does take quite a bit of digging and learning to figure all this out.
- schema on read, means ingestion of large files is super fast (doesn’t enforce the overhead of a schema and data aligning to columns, tables, and rows while loading). ie: a load is a copy file command
- mpp arbitrarily splits the data through hashing and bucketizing, allowing massive scalability on commodity hardware.
- more knobs, bells, whistles and switches mean more tweaking & tuning can be done at the core-level of the system, resulting in more performance (that’s the goal anyhow).
- open-architecture and open-source, means more innovation (in this case), and more tooling to choose from (both a benefit and a drawback).
- no need to ever run a delta check – all history is kept of every file that is ever loaded. however, if you really need to know what’s changed, you have to code the logic to find out.
- no restructuring of the data set until or unless you need to access the data and turn it in to information. (structure on demand is kind of like schema on read).
these are still being worked out. one of the first best practices to emerge is to use hadoop and hdfs as a persistent staging area, then only source the data and it’s history when the business needs it. at that time, load it to a relational store for further analysis and ad-hoc query execution. (along with indexing, joining, governance, and so on).
what about data modeling in hadoop?
this is a question on everyone’s mind. the fact is, hadoop shifts the paradigm to schema on read. this notion of schema on read is very similar to cobol copybook: redefines. only difference is that redefines is fixed based on a data flag. schema on read means that the schema or structure is defined at query / request time, and then applied to the data set underneath. if and only if the data matches the schema, then it is returned. any data that does not match the schema is completely ignored by the processing engine. this is something that has to be learned. unfortunately, it’s not an easy thing for people to grasp without getting hands-on experience.
that said, that raises the next set of questions around: data vault modeling in hadoop, using schema on read. the following questions now come to light:
- what is the value of a hub? what is a hub in hadoop?
- what is the value of a link? what is a link in hadoop?
- what is the value of a satellite? what is a satellite in hadoop?
to summarize, why and /or do i need a data vault model in hadoop??
well now we’ve come to the point of this entry…
let’s start with the hub structure…
the hub is a list of unique business keys defined to be “soft-integrated”, ie: those of the same semantic meaning and same grain can be stored in the same hub structure. now, isn’t one of the purposes of an enterprise data warehouse to assist with master data? in principle at least.. if this is one of the goals, then it would make sense to construct a hub – as a central point of business key integrations.
however, the physical hub structure holds no weight, doesn’t add value to the physical storage components that i can find. which means: you should still model hub entities (do the integration work, do the hard work of understanding the business) at least at the logical level.
back to the physical store for a second… if you have a bunch of data sitting in a file, how do you “identify” it? well – for one, you have to scan it at least once. second, you *should* be looking for keys, keywords, key phrases, something that will either uniquely identify the file, or at a minimum identify the file as part of a group (ie: enter tagging). ok, but what if you have more than one key in the file? what if you have a group of keys identifying multiple things in the file?
well – it goes without saying, that every time you want to dig this information out, you must file-scan the whole file, unless you have something that is “kind of like” an index, and can quickly identify what’s in the file, and what’s not. or you absorb the data in to a storage format that allows indexing, and go from there.
so – is there a need for physical hubs after all?
i reserve judgement on this one, until further tests have been run. but so far, the tests are indicating there’s no need for physical hub structures, but there is a need for logical design of hubs so that we understand the data sets sitting in the data dump. ie: a hub still is a unique list of business keys, but it is a logical entity rather than a physical structure.
what about links?
ok, next part… the link tables. links, love them or hate them, are association tables – just like factless facts. (i can’t stand that term, it is a non-sensical term that truly has no meaning). i prefer the 3rd normal form definition: many to many relationship table. a link has been extended in the hadoop context to be a fact table, inclusive of bridge table definitions, and point-in-time table definitions. it functions as a join table in many ways, don’t get me wrong, it’s not a pure fact table – it doesn’t have surrogate keys pointing to dimensions… no-no, that’ won’t do. it has a set of keys (either from the hub, or natural keys), including aggregations, computed fields, and begin / end dates (temporal dates) for when the relationship is active.
the link table is necessary in a physical sense for several reasons.
- to join across mpp, and create on-node joins before moving data across the network. this is vital to scalability of data sets, and alleviating big data joins.
- to provide faster access of a structured / pre-assembled set of answers.
- to avoid file scans of the root data set until necessary.
- to possibly map the results to a relational database perhaps through hash or natural keys
but are they true link tables as defined in the data vault modeling components? no.
they are link-like tables, but in reality they are 3rd normal form: many to many relationship tables, that appear as fact tables and link tables at the same time. they are: for all intensive purposes, join tables with post-business rule data sets.
these are truly built on the fly / on demand – unless 80% of the “queries” use the shared data sets, then hadoop caches versions of these things as they are built. if anything, they are indirection / pointer tables, join tables, and aggregate tables. there is no “formal name” for these things yet, they are not factless facts, nor are the fact tables. they certainly are not conformed, as the raw data underneath that they point to is also not conformed.
but that begs the question: should these tables be conformed, and if the answer is yes – then that means they are business mart tables complete and total, that they are flat-wide report collection tables / structures, or simply put: aggregate collections of munged data, a result of business rule applied data. remember, if the answer is yes, then that means you are using hadoop processing to pick up all data you “want”, keys and all, run it through some set of processes, and then store the full result – with no relation to the raw data anymore. we are right back to auditability problems… food for thought.
interesting how the game changes? well the truth is, the game is changing and will continue to change as we explore best practices for utilizing a hadoop platform.
ok, what about satellites?
well, we are in the home stretch of this blog entry. satellites are easy (comparatively). if you accept the premise that hubs are necessary logically and that links are necessary (both logically and physically), then that makes the rest of the files (in what ever format they are in), satellites. in other words, you are still after “replicating only the data as needed to access the full source on demand.” otherwise, you would go with “copying/munging/changing” and writing a new file (report collection as defined above).
the only thing left to do, is to say: how do we access satellites / files efficiently through pointers and “joins”? do we use natural keys? do we use hash keys? do we use links? do we pick up the data, and move what we want to a relational db/structured world so ad-hoc works properly? do we wait for the hadoop ecosystem to “catch up” to the relational space in it’s functionality?
these are all good questions, and ones you will have to answer on-site depending on your needs. today, sanjay & i leave the files in place, but tag them with hash keys where possible, we then implement hive internal structures to provide “indexing” and “joining” kinds of things, on the data sets that can be structured. sanjay and i are still using the full data vault contexts of hubs, links, and satellites to work through all these issues.
i would love to have your feedback, if you’ve run any trials or test cases, please elaborate here on what you’ve found. we need to explore these concepts as a world-wide group.
hope this helps,