would you go sky-diving without a parachute? well, if you simply “jump” on the nosql bandwagon with your data warehouse, that is exactly what you are doing. this post is additional exploration and information around hadoop, big data, nosql, and data warehousing. data warehousing on nosql is “generally” a whole new ball game.
first, some thoughts…
no doubt you’ve heard of big data, no doubt you’ve read something on nosql. both of which are very interesting concepts in the world of development. but probably what you didn’t know or didn’t hear is that 90% of the “work” to manage data sets in these environments is writing code.
to be specific, it’s writing (or generating) some form of map/reduce code that can be disbursed across multiple shared-nothing servers, where data (hopefully) is partitioned evenly. beyond that, there are a lot of things that are “left behind” from the relational world, in order to gain performance over big data sets. some of those things that are “left behind” include:
- no more referential integrity
- no more acid compliance (in some cases)
- no more sql access
- no more “indexing” (not as you and i would understand it)
- no more ad-hoc querying <– this one is a biggie and often overlooked!
- no more “data normalization”
- no more updates (caveat – some nosql systems allow this in specific conditions)
before i go any further, and before you say: wait – no, you’re wrong!! hadoop doesn’t have that restriction!! check out the references i used for my research, in case you wish to read some of this information on your own:
and the latest news (of course) is that amazon and paraccel teamed up to smash the entire data warehousing and hadoop barriers:
because it’s still so new, there are still many questions to be answered – but personally i think, this will be a huge run-away success in the data warehousing industry. that is: if you can get past “hosting” or outsourcing your edw in a cloud on someone elses servers.
old school knowledge, new school techniques…
ok, to way-over-simplify, you can think of columnar data stores (for example hbase on hadoop), and defining “a structure” to access structured data, similar to writing a cobol copybook complete with re-defines and nested table structures. you can also think of “access” to those data sets as writing a cobol program (in this case java code, or something that leverages map / reduce code) under the covers.
the storage mechanisms are different, and the simple fact that columns are “name-spaces” and can be dynamically added, or deleted is different.
but before you get excited and jump up and down over “denormalization vs normalization” or “data modeling”, remember this: the logical data model is hierarchical. the physical data store is key,value, timestamp (triple store) technology.
which basically says:
- all column names are replicated for every data element in every row in every table.
- nested or hierarchical complex types are allowed (and can change from row to row!!)
- governance and referential integrity are lost from the “data model” or “base schema” and pushed to the map / reduce code set (become the applications responsibility)
yes, massive storage requirements! but wait, isn’t there compression? yep, but it’s clunky and isn’t always on demand.
so what do you gain from these environments?
well, let me just say: i am no expert in nosql (not only sql) i am learning about these systems, but from many of the in-depth articles i’ve been reading, i’ve got some interesting conclusions:
- you can see performance gains over huge / massive data sets (when compared to storing that same data in rdbms and trying to query it) – particularly when the data set can be “housed” in a single schema structure, and referenced by row key values.
- easy dynamic column changes (this is where the schema-less comes in to play, but in reality, we still have to define structures, and schemas to manage and access data sets
- rapid ingestion of machine generated data sets (higher transaction throughputs because of lack of rdbms overhead) – this just is simply a “file copy”, with hadoop re-distributing the file according to partitions across the nodes. hey wait: re-distribution of huge files takes time!! yes, indeed it does, and no matter what you say, you must define a schema.
- raw data storage, accountable, and auditable
- mpp on commodity hardware
there are a lot of “blog entries” that say hadoop and nosql systems are “schema-less”. it’s only schema-less in certain situations where you write map code that searches “raw bytes” for values (hence the ability to use binary, or pure unstructured data sets) – and produce structured results. for the “rest” of the world using delimited file sets, or fixed with file sets, or xml file sets, we still have to define a schema with column names and table names in order to “use” the data effectively.
so is data modeling completely dead?
exactly the opposite. it is now more important than ever! however, the way we model logically should follow natural hierarchical definitions (think ontologies of key business terms rather than normalized data sets). physical modeling also matters (at least for common reference to “structured or semi-structured data”). and guess what? the schema is loosely coupled to the data set (in column based data stores), that’s what allows it so much flexibility to handle schema change.
why is data modeling still important? you have to know what you’ve got (data elements) in order to begin asking questions about it.
in fact, data modeling (and the modeling tools we use) must now undergo a paradigm shift. i believe to be “successful” in the new nosql world, we should start with ontology modeling of business terms. this hierarchy of business terms then needs to be broken in to sub-sets (which effectively become tables in the nosql environment). from there, we need existing tool sets like powerdesigner, powerarchitect, er-win, and the like to read in these “hierarhcy of business terms”, and allow us to map them to nosql key-value stores (or serde definitions) for use with hadoop.
what are some of the drawbacks of these environments?
well, there are many different ones – and they vary based on the platform of nosql that you choose (ie: cloudera vs hadoop vs hbase vs hive vs mongodb vs couchdb vs paraccel vs netezza and so on). but here are some generic points for most hadoop based systems:
- most of the time, you are stuck writing map / reduce code (not sql) resulting in increasing maintenance costs, and increasing complexity
- data modeling (if not managed) is “tossed out the window” by programmers for expediency sake. in other words: data governance becomes exponentially harder to achieve
- programmers quickly make “new files” to answer “today’s business question, and thus end up replicating data sets all across the platform. it can result in a data junkyardreally quicklyif not managed.
- if the wrong implementation is chosen (to use map for logic instead of reduce or vice versa), performance can suffer dramatically
- rows can have different structures within the same table – and can easily result in “broken code” or missed data
- most solutions cannot support ad-hoc querying. it’s not because of the sql interface technology, but rather because of the execution speed of map/reduce code – trawling over hundreds of terabytes of data. in other words, many “bi analytics customers” are used to sub-second response times to new queries issued, this simply isn’t happening in today’s nosql solutions (those with big-data anyhow).
- limited versions of history. hbase (for example) only keeps the last 3 versions of any given row (unless part of the row key is a timestamp). but then, your 300 tb warehouse explodes in to a 3 pb warehouse nearly overnight.
in addition to the above, there are other issues as well (that face the project level): (taken from the informatica / hadoop technical pdf)
- difficulty finding trained resources with expertise in map/reduce programming and management
- pre and post processing of data in hadoop
- challenges in tracking, and managing diversity of data sets and schemas
- lack of transparancy, governance and auditability over development tasks
- limited data quality and governance of data sets
- high cost of maintenance for scripting / code management
- challenges in meeting sla’s for mixed-workload requirements and ad-hoc querying
ok ok, there’s a lot more to this, and i was just looking at the tip of the iceberg… hopefully this information is helpful to you, and the reason i’m writing these entries is because i can’t discuss data vault modeling on nosql stores without giving you some foundational rules around the subject. in the new year, i will dive in to data vault modeling on hadoop and nosql-ish environments. i have asked amazon for a login to their redshift “paraccel” platform for testing and development. we will see what happens. what i will say, is that platform looks promising indeed – including sql access, and some management layers to “resolve or mitigate” some of the problems i’ve listed here.
what i will say about data vault modeling is: it’s evolving. there are changes that are necessary to the physical level in order to support hadoop, big data, and nosql. but the value of modeling at the logical level (splitting business keys, descriptors and relationships) still remains high. i will also say that the jury is “out” on using surrogate sequence numbers. in a relational world, they work great. in a hadoop / nosql environment they can wreak havoc on the mpp distribution algorithms, and can cause hot-spots on mpp platforms (except in teradata where you can choose a different primary index or hash key for data splitting).
i’d love to hear your feedback, thoughts, comments. are you using hadoop? have you put a data vault-like schema structure on a nosql box? if so, what did it look like? what modeling changes did you have to make? how is it running for you?