like many of you, i’ve been working in this industry for years. like you, i’ve seen and been a part of massive implementations in the real world where #bigdata has been a big part of that system. of course, they’ve done this in a traditional rdbms and not a nosql environment. there’s even been a case where data vault has been used to process big data with huge success. in this entry i will discuss the problem of big data, and discuss the rdbms along side the nosql approaches.
first, a little history please…
why is everyone up in arms about #bigdata? why is everyone so glued to their technology that they feel the need to “switch to something new” (ie: nosql environment) just to handle big data? have we literally lost our minds? do we think that by switching to “magical new technology” that all of the sudden all of our problems will be solved? if you listen to the industry hype, it appears this way. the hype goes so far as to suggest that nosql platform is the magic bullet that everyone is waiting for.
well, i’m here to tell you that it may not be everything that media hype has cracked it up to be; and if you aren’t careful, you may just lose sight of your massive investments in relational data base systems. if you aren’t choosing nosql for the right reasons and for the right applications then you may just be losing your mind! (ok, that’s a bit of a stretch, but it had to be said)
historically speaking, can traditional rdbms do big data? if so, where are the case studies? well let’s chat about this. everyone seems to believe (and i concur) that big data is made up of: volume, velocity, and variety. ok, lets take two particular pieces of the puzzle: volume and velocity. for now, we won’t discuss variety. why? because many of the hyped up stories out there talk about volume and velocity as being “good enough” reasons to switch to nosql for big data purposes.
what’s one base argument for “switching to nosql” for big data?
historically speaking, lets focus in on data ingestion rates or better yet: transaction rates (aka: tpc-c benchmarks). many argue that you must have a nosql (distributed file store) without referential integrity turned on, in order to get massive ingestion rates on big data. well, ok – this may be true, but it depends on what you call “big data” and what level of transactions per second your business truly is doing. if you are splitting atoms and recording 1 terabyte per second, then yes, i’ll buy that argument – but that means that you are also in the top 1% of world wide cases that actually need this functionality.
case studies from rdbms that the media hype wants you to ignore
any time there is a shiny new widget in technology, it seems everyone rushes to get one – rushes to build one, rushes to supplant their hard-earned and well designed investment. well, i’m here to remind you that it may not be necessary to jump ship… i tend to argue that if you really think you have a business case for a nosql environment, then you should be looking at a cooperative seamless integration point.
by the way, the data vault model offers this solution through architectural layers and good design. in the future, the rdbms engines will offer this level of integration, and in fact, some new nosql offerings are already “combined” under the covers offering both relational technology and hadoop technology seamlessly.
so who’s done what?
- teradata: nearly everyone talks about big data, and then throws walmart out in to the discussion – unfortunately they tend to “forget” to mention that walmart is on rdbms engine teradata. the case study here shows: walmart processing nearly 5,000 items per second and 10 million register transactions over a 4 hour period. http://news.walmart.com/news-archive/2012/11/23/walmart-us-reports-best-ever-black-friday-events
teradata can clearly “do” big data with no problem. teradata has had success over the years with barabas bank, quantas airlines, and a variety of department of defense initiatives – all of which doing incredibly high rates (volume) per second (velocity). i’ve seen teradata actually beat this number in other customer sites with 8,000 transactions per second. teradata is an mpp based database.
- sqlserver 2008: yes, people forget – sqlserver 2008 r2 is an incredible beast when it comes to processing power. it actually is a very good contender in the bigdata space, and should be considered with proper tpc-c rates. this study reports: about 16,000 transactions per second on sqlserver 2008 r2. http://sqlblog.com/blogs/linchi_shea/archive/2012/01/24/performance-impact-sql2008-r2-audit-and-trace.aspx,. sqlserver 2008 r2, can be configured in clusters of mpp machines.
- db2 for zos edition: db2 engine is very very good, granted this was the mainframe edition. it’s matured over the years, and is a tremendous database in the world of relational database engines. they’ve incorporated mpp at various steps as well. this study stated the following (slide 26): 15,353 transactions per second at a korean bank. http://www.slideshare.net/cuneytgoksu/db2-10-for-zos-update
- db2 udb / ldw / rdbms: i’ve personally been involved in a case study at a us defense contractor where we were ingesting 10,000 transactions per second per node with 10 nodes – in to a data vault model on db2 udb (rdbms instead of mainframe edition). this was in 2001 – at the time, this was record breaking as well. we were pulling space satellite data (structured transactions), then combining them in to a single db2 node post processing.
- oracle rdbms: oracle’s exadata v2 machines are fast, and only getting faster. however, this was oracle’s 10g product – and i’m sure since this was published several years ago, it must have improved since then. 60,000 transactions per second is the claim – very very good for an rdbms engine. http://www.dba-oracle.com/oracle_news/news_world_record_tpcc_hp.htm
i’ve personally been involved in data vault cases with each of these databases where the transaction per second ingestion rates were quite high (between 7,000 and 10,000 transactions per second) – and that’s with referential integrity turned on in these databases.
what about variety?
we can’t talk about big data without talking about variety. variety of the data can mean lots of different things. to be completely fair, unstructured and semi-structured data needs to be included in big data, and admittedly this is one area where relational technology really doesn’t do a really good job. they (the rdbms engines) have introduced full-text indexing to attempt to solve the problems, they’ve also introduced xml data type stores and xql-like query basics, but in reality the ingestion rates drop tremendously when dealing with this type of data.
remember: variety can mean different things. variety can also mean: multiple structures (as in multiple different structured transactions), and this is something the rdbms engines are very very good at handling.
so where does that leave us?
it’s rather ironic really, there is so much hype out there about nosql and “why rdbms won’t work for you” that it’s easy to get lost in the sea by swimming in to an undertow on purpose. people for some reason conveniently “forget” that rdbms engines are still the number 1 selling licensed database engine technology out there, and there’s very good reasons for that. people tend to “forget” that the rdbms engines bring tremendous value when it comes to the integrity of the information store.
am i saying that nosql is “bad”? no not at all. i’m merely stating that it has it’s purpose, and in my mind it serves the “data warehousing community” as a raw storage area for text documents and xml documents and unstructured data sets best for 98% of the market space. for the top 2% of the market space (whom might be dealing with 1 terabyte a second of ingestion), a highly scalable solution where the data set can be broken into thousands of smaller files to be ingested, might be best suited to a nosql file store rather than an rdbms.
we should not lose sight of the value that rdbms engines have. we should not lose sight of the vested investment in cost, engineering hours, and reliability that the rdbms engines bring to the table. we should not simply throw the baby out with the bath water to justify bringing in the new technology – just because “i want a shiny new object.”
the main point is this: the more reading i do, the more discovery that i find (in labs, test cases, and build outs), the more i understand that nosql is and should be a complimentary platform, to sit along side your vested edw / rdbms engines. it should be applied as a staging area (if you really feel the need to put structured data on a nosql environment) , but better than that – it should be used as a raw text document storage area, and an xml file storage area, where these objects are then properly keyed and distributed.
remember: some of these nosql technology vendors are offering seamless hybrid technology – these are the “ones to watch” in my book, these are the ones that will make it worthwhile for the transition to occur – but the engineering on these hybrid systems is mostly new, when compared with rdbms technology that has been around for a long long time. yes, they will converge, but in doing so, they will each absorb best of breed from each others solutions under one roof.
end of the day, what are the main points for choosing nosql for big data?
in my humble opinion i would chalk it up to these:
- if you have to deal with, search, learn from, interrogate: text documents, xml, audio files, images, or binary data
- if your ingestion rates are so high that you need a truly elastic compute cloud with highly distributed data sets, and here’s the kicker: the cost of rdbms is prohibitive for scale out when compared with nosql
hmmm, be careful about cost – cost can mean many things. often times the hype will have you focus in on only one aspect of cost: storage and number of mpp nodes. they conveniently “forget” to mention cost of maintenance, cost of upkeep, cost of tuning, cost of storage, cost of coding to make the data usable , cost of accessibility and so on…
a final point:
some have argued that graph databases require a nosql environment. those arguments are true for folks who have not looked at or do not know about data vault modeling. data vault models actually enable you to place a “graph based database” directly in a relational database engine, and allow you to take full advantage of the sql language for graph based queries and results. the data vault model allows you to explore a graph based data set in a relational world.
so in closing, don’t throw out your rdbms engines just yet, if you have bigdata or big data needs, please try to evaluate the technology on principle and roi. don’t forget that rdbms engines have been “doing big data for years” in a structured world.
leave a comment and let me know when you might apply nosql for big data, and whether or not you see data vault in these situations.