Data Scientist

#bigdata #noSql and #datavault 2.0

as you know, sanjay and i have been hard at work on this topic for the last two years.  i am preparing to release some of the findings soon, in fact, i will release some thoughts, and difficult questions that we propose to the market place in this blog entry.  i encourage you to offer your feedback to the questions we pose by supplying replies to this blog entry.  note: this is a result of discussions and hands-on that both sanjay & i have had over the past two to three years.  these are my opinions.

defining our #nosql platform

there are so many #nosql platforms and now, newsql platforms in the market, that i thought it wise to discuss the very platforms that sanjay and i have actually been working with – and why.  we have several platforms we’ve been testing with.  the first, is pure apache hadoop & hdfs – let it be known that when i say “hadoop” in this post, that is the context under which i am discussing the tests.  when i say “hadoop platform” i am referring to the additional components such as sqoop, flume, and the rest of the applications on top of hadoop.  when i say the hadoop ecosystem, i’m referring to all of the above plus the hardware we are running on.

our second platform is a stripped down version of cloudera – sanjay removed many of the additional tools that we did not wish to test with.

our third platform is hadoop + hbase – specifically for hbase testing purposes.

our fourth platform is hadoop + hive (internal and external)

our fifth platform is now hadoop + stinger

what are we trying to accomplish?

we are trying to put the entire enterprise data warehouse (including data vault 2.0 modeling, analytics, real-time feeds, batch feeds, cleansing, business rules, managed self-service bi, quality, metrics, control / security, access, etc…) all on the hadoop ecosystem.  our primary focus is to answer the following questions:

  1. what pit-falls are there when compared to a standard / traditional relational db for edw ecosystem?
  2. what positives / good things are there when compared to a traditional relational db for edw ecosystem?
  3. what best practices should be standardized on in this environment?
  4. what happens to data modeling in the hadoop ecosystem that we need to be aware of?

we have about 50 more questions but i won’t bore you with the list.  there are other, far more pertinent questions which i respectfully ask you to reply on, that i’ve posted further down in this blog.

a bit of history around data vault 2.0 system of bi

for those who don’t know, dv2 is a system of business intelligence that includes: agility, cmmi, six sigma, kpa/kpi’s, pmp, best practice data modeling, best practice data integration, separation of hard & soft business rules, and more, all for the purposes of building an enterprise data warehouse in an incremental fashion with rapid build out.

ok, back to point….


there are pitfalls of these hadoop based ecosystems that vary with the physical setup and implementation choices.  there are over 50 different significant pitfalls (some are mitigated by other technology, some are not).  here (below) are a few – again, keep in mind these are dependent on the physical layout, storage, and implementation chosen.

  1. hdfs does not allow update in place to “files” (this can be seen as a benefit actually)
  2. hdfs does not understand indexing – ie, every access is a full file scan
  3. hadoop has a big learning curve for non-programmers
  4. hdfs can implement many different “physical storage formats” – causing problems for, or helping performance.
  5. hadoop on mpp still exhibits the data co-location, and cross-node “join” problems.
  6. hadoop itself, doesn’t understand joins without writing map & reduce code.
  7. hadoop schema on read makes it extremely difficult to optimize performance for ad-hoc queries
  8. the general population doesn’t understand schema on read vs schema on write
  9. data modeling doesn’t/can’t/won’t apply to all data absorbed by hadoop (think word docs, images, movie files, sound files)
  10. every hadoop based “solution” is different than the last, meaning consistent design doesn’t work across “hadoop platforms”.
  11. there is no governance over hadoop – anyone with a login has access to everything in the hdfs (hadoop file store)
  12. there is no metadata lineage / data lineage in hadoop, no way to know where data came from, who’s accessing it / when, why, or even what they are doing with it.
  13. hadoop is not transactional driven, it’s a file store – it is “batch load” of a file.
  14. hadoop keeps history of everything all the time, managing that history can be a nightmare without additional tooling on top.

therefore, i will make the following theoretically applied statement: everything in hdfs / pure hadoop, is just a data dump, a data junkyard – it is not nor will it ever be, a data lake.  not without mitigating the issues above, which in turn requires additional solutions to manage.  at the bottom of this post, we’ll get in to what this means to a data vault model (nothing else at this time), there is more to come…

benefits / positives over a traditional rdbms edw

now, let’s take an opinionated look at what i’ve found to be some of the benefits.  these are nothing new – most of these benefits have been described as the reason for hadoops’ existence.  this is not rocket science, but it does take quite a bit of digging and learning to figure all this out.

  1. schema on read, means ingestion of large files is super fast (doesn’t enforce the overhead of a schema and data aligning to columns, tables, and rows while loading).  ie: a load is a copy file command
  2. mpp arbitrarily splits the data through hashing and bucketizing, allowing massive scalability on commodity hardware.
  3. more knobs, bells, whistles and switches mean more tweaking & tuning can be done at the core-level of the system, resulting in more performance (that’s the goal anyhow).
  4. open-architecture and open-source, means more innovation (in this case), and more tooling to choose from (both a benefit and a drawback).
  5. no need to ever run a delta check – all history is kept of every file that is ever loaded.  however, if you really need to know what’s changed, you have to code the logic to find out.
  6. no restructuring of the data set until or unless you need to access the data and turn it in to information.  (structure on demand is kind of like schema on read).

best practices…

these are still being worked out.  one of the first best practices to emerge is to use hadoop and hdfs as a persistent staging area, then only source the data and it’s history when the business needs it.  at that time, load it to a relational store for further analysis and ad-hoc query execution.  (along with indexing, joining, governance, and so on).

what about data modeling in hadoop?

this is a question on everyone’s mind.  the fact is, hadoop shifts the paradigm to schema on read.  this notion of schema on read is very similar to cobol copybook: redefines.  only difference is that redefines is fixed based on a data flag.  schema on read means that the schema or structure is defined at query / request time, and then applied to the data set underneath.  if and only if the data matches the schema, then it is returned.  any data that does not match the schema is completely ignored by the processing engine.  this is something that has to be learned.  unfortunately, it’s not an easy thing for people to grasp without getting hands-on experience.

that said, that raises the next set of questions around: data vault modeling in hadoop, using schema on read.  the following questions now come to light:

  1. what is the value of a hub? what is a hub in hadoop?
  2. what is the value of a link? what is a link in hadoop?
  3. what is the value of a satellite? what is a satellite in hadoop?

to summarize, why and /or do i need a data vault model in hadoop??

well now we’ve come to the point of this entry…

let’s start with the hub structure…

the hub is a list of unique business keys defined to be “soft-integrated”, ie: those of the same semantic meaning and same grain can be stored in the same hub structure.  now, isn’t one of the purposes of an enterprise data warehouse to assist with master data?  in principle at least..  if this is one of the goals, then it would make sense to construct a hub – as a central point of business key integrations.

however, the physical hub structure holds no weight, doesn’t add value to the physical storage components that i can find.  which means: you should still model hub entities (do the integration work, do the hard work of understanding the business) at least at the logical level.

back to the physical store for a second…  if you have a bunch of data sitting in a file, how do you “identify” it?  well – for one, you have to scan it at least once.  second, you *should* be looking for keys, keywords, key phrases, something that will either uniquely identify the file, or at a minimum identify the file as part of a group (ie: enter tagging).  ok, but what if you have more than one key in the file?  what if you have a group of keys identifying multiple things in the file?

well – it goes without saying, that every time you want to dig this information out, you must file-scan the whole file, unless you have something that is “kind of like” an index, and can quickly identify what’s in the file, and what’s not.  or you absorb the data in to a storage format that allows indexing, and go from there.

so – is there a need for physical hubs after all?

i reserve judgement on this one, until further tests have been run.  but so far, the tests are indicating there’s no need for physical hub structures, but there is a need for logical design of hubs so that we understand the data sets sitting in the data dump.  ie: a hub still is a unique list of business keys, but it is a logical entity rather than a physical structure.

what about links?

ok, next part…  the link tables.  links, love them or hate them, are association tables – just like factless facts. (i can’t stand that term, it is a non-sensical term that truly has no meaning).  i prefer the 3rd normal form definition: many to many relationship table.  a link has been extended in the hadoop context to be a fact table, inclusive of bridge table definitions, and point-in-time table definitions.  it functions as a join table in many ways, don’t get me wrong, it’s not a pure fact table – it doesn’t have surrogate keys pointing to dimensions… no-no, that’ won’t do.  it has a set of keys (either from the hub, or natural keys), including aggregations, computed fields, and begin / end dates (temporal dates) for when the relationship is active.

the link table is necessary in a physical sense for several reasons.

  • to join across mpp, and create on-node joins before moving data across the network.  this is vital to scalability of data sets, and alleviating big data joins.
  • to provide faster access of a structured / pre-assembled set of answers.
  • to avoid file scans of the root data set until necessary.
  • to possibly map the results to a relational database perhaps through hash or natural keys

but are they true link tables as defined in the data vault modeling components?  no.

they are link-like tables, but in reality they are 3rd normal form: many to many relationship tables, that appear as fact tables and link tables at the same time.  they are: for all intensive purposes, join tables with post-business rule data sets.

these are truly built on the fly / on demand – unless 80% of the “queries” use the shared data sets, then hadoop caches versions of these things as they are built.  if anything, they are indirection / pointer tables, join tables, and aggregate tables.  there is no “formal name” for these things yet, they are not factless facts, nor are the fact tables.  they certainly are not conformed, as the raw data underneath that they point to is also not conformed.

but that begs the question: should these tables be conformed, and if the answer is yes – then that means they are business mart tables complete and total, that they are flat-wide report collection tables / structures, or simply put: aggregate collections of munged data, a result of business rule applied data.  remember, if the answer is yes, then that means you are using hadoop processing to pick up all data you “want”, keys and all, run it through some set of processes, and then store the full result – with no relation to the raw data anymore.  we are right back to auditability problems…  food for thought.

interesting how the game changes?  well the truth is, the game is changing and will continue to change as we explore best practices for utilizing a hadoop platform.

ok, what about satellites?

well, we are in the home stretch of this blog entry.  satellites are easy (comparatively).  if you accept the premise that hubs are necessary logically and that links are necessary (both logically and physically), then that makes the rest of the files (in what ever format they are in), satellites.  in other words, you are still after “replicating only the data as needed to access the full source on demand.”  otherwise, you would go with “copying/munging/changing” and writing a new file (report collection as defined above).

the only thing left to do, is to say: how do we access satellites / files efficiently through pointers and “joins”?  do we use natural keys? do we use hash keys? do we use links? do we pick up the data, and move what we want to a relational db/structured world so ad-hoc works properly?  do we wait for the hadoop ecosystem to “catch up” to the relational space in it’s functionality?

these are all good questions, and ones you will have to answer on-site depending on your needs.  today, sanjay & i leave the files in place, but tag them with hash keys where possible, we then implement hive internal structures to provide “indexing” and “joining” kinds of things, on the data sets that can be structured.  sanjay and i are still using the full data vault contexts of hubs, links, and satellites to work through all these issues.

i would love to have your feedback, if you’ve run any trials or test cases, please elaborate here on what you’ve found.  we need to explore these concepts as a world-wide group.

hope this helps,

dan linstedt

Tags: , , , , , , , , , ,

10 Responses to “#bigdata #noSql and #datavault 2.0”

  1. Julian 2015/07/09 at 2:09 am #

    I research on data models for a big data hub on the hadoop ecosystem at the moment. We process event based weblogs from different sources including Google Analytics for example. The goal for now is to create standard reports and do some ad hoc analysis on the raw data. For that purpose we load the data from the hub to a relational DWH. But the data hub must be generic for other frontend services and tools. For example, in future maybe there is some sort of intelligent data mining system or something like that.

    I read a lot about traditional star schemes and 3rd normal form models. I found Data Vault per chance and I think it is exactly what I need. The schema must be flexible and adaptable.

    I implemented a basic DV model on one exemplary data source with Apache Hive. This started before this blog post came out. So I decided to do it very naively for the first try and implement it how the DV specification suggests.

    Now I have some questions to this blog entry. You say that hub tables don’t add value and should only be logical structures. I definitely understand why we need to model the hubs logically. But why we are not supposed to implement the hubs physically? I didn’t understand the paragraph “Back to the physical store for a second”. Can you maybe try to explain this in other words with an example? For example: One entity is a webpage and another entity is a visitor. A visitor can vistit a webpage. One visit is one entry in the link table between the two entities. And I have hub structures for visitors and pages, while a visitor is identified by a calculated key from the tracking service and a page is identified by the unique pagepath.

    To the link structes: I think, they absolutley have to be conformed. So in my web analytics example, I have to do some sort of “semantic normalization” or “event normalization” before I push the data to the DV. If the schema wouldn’t be conformed, the whole affair wouldn’t make sense in my view. Then I could also use the raw data structures I get from the tracking service.

    In general, at the moment I’m not sure if normalization of event data to a really normalized data structure as Data Vault is the right way. On the one hand it is really flexible for additional attributes or dynamic requirements. But on the other hand, this produces a lot of join operations when bringing the data to the next layer (the relational DWH for example).

    One other thing I really want to discuss about: Did you ever hear from Lambda Architecture? Nathan Marz suggests a graph schema for the master dataset (which seems to be something like a data hub or “single source of truth”, or call it “Enterprise Data Warehouse”). I think this graph schema is logically exact the same than the Data Vault approach. Hubs are the nodes, Links are the edges and Satellites are the additional attributes. And if you implement Data Vault in Hive, every table is one physical file or folder. Isn’t it really the same approach to split objects from their relationships and descriptive attributes? Maybe I’m see this totally wrong… I’m really excited to hear a statement from you. Thank you in advance.

  2. Dan Linstedt 2015/07/09 at 7:36 am #

    Hi Julian,

    Thanks for commenting, I appreciate it. I’m happy to hear you are working with DV, and that it is helping you succeed.

    Before I get too deep, as you have a number of questions / thoughts buried in your comment… Let me say these things:

    1) Hubs have no value in pure HDFS, if you are using Hive, or other relational technology, then YES they have a LOT of value
    2) Satellite restructuring doesn’t do a lot of good if you are using pure HDFS. If you are using Hive or other relational technology, then MAYBE Satellite restructuring is a good thing – it MAY add value, it MAY NOT – it depends on the case, the hardware, the capabilities of the platform and so on.
    3) Links always have value – regardless of the technology.

    Yes, you got it right – the DV model is built to be flexible and scalable – absolutely.

    Now, with regards to normalization…

    WHEN a DV is built in a Hadoop Ecosystem – normalization (restructuring of the Satellites) is a BAD idea. Hadoop has various methods of storage formats, ie: key value, JSON, XML, it also has performance optimizations underneath, ie: parquet, and so on… Do not normalize satellite data in Hadoop – this is the message. AS LONG AS you can KEY the documents appropriately (compute the hash keys, or use the natural keys).

    When the DV is built in the RELATIONAL / traditional RDBMS (Oracle, SQLServer, Sybase, DB2 UDB, Teradata, MySQL) – then normalizing the Satellites MAY have added benefits. Now, lastly, these are things I’m currently building training classes for – please keep this in mind, this is a shifting set of sand (as far as standards go), because the technology is MOVING constantly forward at high rates of change.

    With regard to the LAMBDA architecture – no idea who this individual is. But I state: in Super Charge your Data Warehouse – that the DV model is BASED on Graph Networks, scale free mathematics, and Hub/Spoke architecture. I’ve been working with Graph Technology in Relational Databases since the 1990’s. But, there are people who have successfully implemented DV models directly in Graph Database Engines, they are made to work together.

    Hope this helps,
    Dan Linstedt
    Want consulting time? Book it here, on this site in 2 hour chunks.

  3. Sanjay Pande 2015/07/30 at 9:24 am #

    Hi Dan and Julian,
    I’m jumping into the conversation here. Both of you are discussing stuff that I’m researching, so it’s interesting to me. My perspective is only in addition to Dans.

    1) Hubs have no value in pure HDFS, if you are using Hive, or other relational technology, then YES they have a LOT of value

    I agree to this to an extent, but pure HDFS is essentially a misnomer. It’s just a filesystem. You don’t really use just a file system for solutions or you’ll end up building massive parsers for anything. Hive and HBase both use HDFS for storing their data. Hive has a flexible storage format. If you use ORCFile, you have an optimized columnar type storage format. If you use Avro, you have splittable binary JSON etc etc. These files are readable through Hive and various other tools, but are not really text files per se. The only people I know who use the file system to build solutions are product creators/vendors and for staging data to be pre-processed and/or loaded.

    With Data Vault 2.0, you essentially hash the business key which enables you to expand your data model across systems. You can have a satellite table with audio stored in HBase, a JSON satellite stored in some other tool (or in HDFS directly) connected to a Hive Hub table for example.

    In essence what he said is correct, but I want to expand on the fact, that he’s not against hubs (At least I hope not).

    Also, personally I see no value in creating any solution on pure HDFS (except for maybe archiving).

    2) Satellite restructuring doesn’t do a lot of good if you are using pure HDFS. If you are using Hive or other relational technology, then MAYBE Satellite restructuring is a good thing – it MAY add value, it MAY NOT – it depends on the case, the hardware, the capabilities of the platform and so on.

    I actually disagree with this. The storage formats like JSON etc are used essentially for structural flexibility. As long as the documents within the collection have a sort of master key which is a hash tied to a hub somewhere I think it has tremendous value since re-structuring is naturally easy here (This is of course use-case dependent).

    3) Links always have value – regardless of the technology.

    Links have always enabled the DV to be extremely valuable since you can join across domains, geographies and now even across varied storage technologies (which is how DV 2.0 can extend beyond relational databases)

    4) Graph databases – I think you can build a DV in a graph database, since they’re so similar to each other but graph databases are again good for certain use cases. They would serve pretty well for the hubs and links, but satellite data may be better off in set based structures. Also, when you start getting into thinks like link networks and analyses, complex hierarchies etc, graph databases are a better choice than relational. However, for processing set based data, relational is the better option.

    For Julian’s use case above, I’d probably use Hive with ORCFile format with a splittable compression algorithm like bzip2 or LZO would be perfect, since all the data is in essence tabular.

  4. Julian 2015/08/03 at 3:07 am #

    Hi Sanjay and Dan,

    I am writing my thesis about data models on a data hub at the moment. I totally see the advantage of data vault regarding flexibility, agility and scalability.

    But what I am really not sure about is the performance. In general, one says JOINs in a hive/map reduce environment are bad. But because of the normalization caused by data vault modelling, you produce potentially a lot of tables, which in some cases have to be JOINed together for use in some services on top.

    So what are your experiences about performance on data vault model in a hive environment? My expierience is, that in most cases a fully denormalized “one big table”-structure performed faster. Only a few queries which called only one or a few of the attributes were faster. But I also have to say that my test dataset is not really big and the cluster has a irregular workload. So for me it is really difficult to produce representative results.

    For my thesis, I need some references about the performance aspects. Do you have some values which I maybe can refer to? What are your arguments and experiences about this?

    I also wondered, if you really have a benefit of using a columnar format like ORC or parquet for you data vault model. Data Vault produces a implicit, vertical partitioning of you data (regarding to you satellite grouping stategy). So if you seperate every single attribute to his own satellite, physically, all attributes are stored seperately, which is what a columnar format basically does. So you can decide, if you seperate you attributes (group them by source system, by change freuquency, etc.). I think you have the most flexibiliy, when you seperate every attribute to his own sat. But then, you product potentialy even more JOIN operations… hmm my head is spinning!

  5. Dan Linstedt 2015/08/03 at 6:01 am #

    Hi Julian,

    as much as I would love to give you performance numbers, I simply don’t have the time to execute the tests. But I will say this:
    In general, Data Vault modeling has nothing to do with the technology underneath. Data Vault 2.0 Modeling should be leveraged at the logical levels when dealing with NoSQL technology. Remember: NoSQL can include Graph databases, column stores, and a variety of other formats (including Key=Value stores, triple stores, etc…) Data Vault modeling is first and foremost about structuring data that passes through the business process layers, and attaching value to it.

    That said: we have several projects from around the world who have used DV2 modeling techniques inside Hive, and a columnar store and they’ve succeeded. Remember that Hive (as Sanjay would tell you) has a number of storage options, and of course it also depends on WHICH NoSQL engine you have running on top of Hive. You will get different results for MapR than you will with HortonWorks, and each different than a Cloudera Installation.

    You need to think about this: Data Vault modeling is at it’s best value when the Data Architect understands how to leverage it logically, attaching data to the business processes through the flow of the business keys – regardless of the physical implementation.

    Whether or not you physically implement a Data Vault model doesn’t matter. What matters is: can you identify properly, business keys? can you identify relationships across multiple systems and data states? can you describe the business keys and relationships accurately?

    I do have a couple customers with BIG success running a real-time DV2 model on top of a columnar store. Just remember: a column store at the physical level is equivalent to 6th normal form – there is no such thing as “normalized vs denormalized” modeling at that level, it’s all the same! It’s neither DV nor star schema, not even anchor modeling… it IS every column has it’s own table at the physical level. That just backs up my statements above: DV Modeling is done at the logical level for the most benefit.

    Regarding performance numbers – no, you will have to get those yourself / do those experiments yourself. Your university should have access to cloud / student instances of different products in the cloud. Just remember: any time you produce “performance numbers” you need to state the versions of the OS and the Hadoop software you are using, along with all the hardware specs so others (if they want) can reproduce your efforts.

    Besides, if you are looking to physically”normalize” data in a NoSQL platform, then you are “doing it wrong”…. food for thought.


  6. Sunile Manjee 2015/09/12 at 12:23 am #

    I work for one of the largest BI Platform companies and I am building a practice for Big Data Modeling. I found this entire conversation interesting. My expertise is in hadoop. I am constantly researching additional modeling techniques for Big Data. With FULL RESPECT, the following statement did not sit well with me when answering Julian question:

    “Whether or not you physically implement a Data Vault model doesn’t matter. What matters is: can you identify properly, business keys? can you identify relationships across multiple systems and data states? can you describe the business keys and relationships accurately?”

    Offering only logical models in big data is not enough. Julian questions and/or concerns are legit and need to be addressed. Many NoSQL engines are not built for scans. They need the access path predefined for random reads and writes. Joins may cause performance deterioration. Hive allows for joins but there are limits here as well. Structuring your data set to reduce # of mappers is vital. Many create hyper dimensional structures. These structures are a maintenance nightmares and lack any application towards a unified model. Logical data models (using star or 3nf) can represent and identify business keys and relationships across multiple systems and data states accurately. So in big data what value add is DV when it comes to both logical AND physical? A logical model design pattern using 3nf, star, or DV is a religious battle I personally am not interested in. What I want is solutions which go across logical AND physical. Only providing a logical solution in big data is not enough for implementation.

  7. Dan Linstedt 2015/09/12 at 4:57 am #

    Hi Sunile,

    Thank you for your comments. I am sorry that the statements don’t sit well with you, however it is not enough to simply disagree with the statement, you need to offer proof of what works in your specific environment. My statements that you quoted are strictly with regard to this fact: what works at a physical layer for Hive doesn’t work at a physical layer for JSON store, or MongoDB, or Cassandra. What works for an XML store doesn’t work for a column-family store. What works for a column family store doesn’t work for a Key-Value store or Document store… EACH physical store uses a completely different physical data model, therefore your physical data model must adapt and change to meet the needs of the physical storage system.

    Your Logical Data Model *should not* match your physical store, why? because of business understanding. Logical models need to match concepts and business cases, NOT bend to physical platform storage rules.

    Yes, you need physical data models (I am not sure you read far enough in to my post or my comments if you think I made this claim). But your physical models will be tied directly to the physical platform on which you implement.

    “many NoSQL are not built for scans…” unfortunately you took my statement out of context. I was strictly speaking about Hadoop platform. I was NOT speaking of NewSQL engines (the ones that allow indexing, and directed access paths). There are plenty of Hybrid solutions that are called NewSQL engines that do ALL kinds of hybrid things (like indexing, joins across multi-structured and fully structured data sets. Please don’t take my statements out of context. Some include the Stinger initiative better known as SparkSQL.

    What you want ie: “a logical model that go across logical and physical” does NOT achieve business goals. Therein lies a conundrum. Business goals are based on ontologies and concepts, and there isn’t yet a single modeling technique that spans all. Today, we are stuck with Conceptual models, ontology models, and yes logical and physical models. It is next to impossible to find “one modeling technique to rule them all”.

    Your interest in what you call “Big Data” modeling is driven from a highly technical perspective – a single type of data model matching both Logical and Physical in format to meet the needs of the storage layers CAN be dangerous. This also destroys the purpose of Logical modeling from the get-go, and if you aren’t satisfied with my answer, please go speak to Kent Graziano ( he’s a huge data modeler and understands all of these components in great detail.

    Now, the next topic you bring up: what do you qualify as “Big Data”? I am having a hard time understanding exactly what you mean by Big Data. I’ve defined the term, and what I mean by the term in many different posts. My point is this: you CAN do Big Data in a fully relational system. Teradata, Oracle, SQLServer, DB2 UDB have all been doing “hundreds of terabytes” for years. We can use traditional techniques like Data Vault, normalized forms, and Star Schema or dimensional modeling on these systems. So I ask again, please clarify exactly what you mean by Big Data?

    Thank-you kindly,
    Dan Linstedt

  8. SM 2015/09/12 at 11:16 pm #

    I agree with your statements about logical models. However most models which started with relational theory do not transform well physically into hadoop ecosystem. Existing tools and design patterns are constrained by relational theory and therefore naturally travail in hadoop implementation. I challenge you to think of design patterns which can answer business questions (logical) and have a autochthonous implementation in hadoop.

    Big data? Big data is a marketing term. IoT and interactions are some of the few types of data patterns which find themselves in hadoop. This is due to ingestion friendly ecosystem using the “touch it take it” map-reduce paradigm. You “CAN” do most if not all on relational with inexhaustible budget and concomitant complex ETL.

    I am in no way anti-relational. I view the world through a analytic ecosystem which all platforms have a meaningful role.

    Last thought-
    Dan I very much like your architecture. I just don’t believe it apropos at this time for hadoop; however, I believe it can be. If your up for it, I would like to collaborate with you further regarding these concepts in and its refinement in implementation within hadoop.

  9. Dan Linstedt 2015/09/15 at 11:27 am #

    Hi Sunile,

    I’m glad to hear that you’re a fan. I think you should take a look at this study:
    Yes, it’s done by SAS, However it shows that a relational model (in this case a star schema) does far better on Hadoop + Hive Internal than a denormalized model. I’ve got other cases in the Netherlands that show working Data Vault models operating at screamingly fast speeds on Columnar databases (NewSQL) technology.

    I think what is important from these emerging cases is as follows: 1) you need to know your data, inside and out before you can optimize it or design the right model for it. 2) physical storage design is SEPARATE in Hadoop / Hive from Phsyical Data Modeling, and the two are not always in sync. SOME of the physical data modeling characteristics change depending on the data layout (MPP distribution) and the additional Index Tables, and the Storage Formats (like AVRO, Parquet, RCFile, ORCFile, etc..) 3) Data Vault models are built to represent the business processes at the lowest level of the business hierarchies (by tracking business keys). They therefore are at the lowest level of grain in a Logical Data Model context. They can, have, and do translate well to Hadoop / Hive Instances, and will continue to work well in Graph DB’s (direct match to these), and alternative columnar appliances.

    Answering business questions is _always_ the issue, regardless of the data model design patterns used underneath. The Data Vault models are *NOT* geared for business user direct consumption (never have been) but rather, are geared towards being back-office flexible, scalable, and auditable.

    The reason I asked you to define your term Big Data is because everyone sees it / defines it differently. I don’t believe such a term truly exists, yet I use it for marketing purposes (like everyone else). The same goes for IOT – what really is IOT? Why should we care? By the way, I’ve been in the industry long enough to have called it: Machine Generated Data way back when. It’s been around a LONG LONG TIME, and I’ve been using (what’s now called IOT data) in my structured Relational Warehouses for years and years. To me it’s just all MORE data faster these days. I am very familiar with the rapid ingestion rates of Hadoop, and make no claims otherwise.

    Analytic ecosystems are great, and necessary component of the outcomes of data warehouses – however, when dealing with raw data, it’s a messy messy world… THAT is where the Data Vault Model plays and helps to clean things up and identify them at the lowest level. Analytics are the “after effect” placed on top of a solid data warehouse, that really make business sing. In terms of managed-self-service BI, I tend to call it: “Exploration Warehousing” which is what Bill Inmon defined it to be in the late 1990’s, early 2000’s.

    There is definitely a place in the stack for:
    * Hadoop / Hive performance and tuning
    * Index Tables
    * Normalized Data Models on Hive Internal tables
    * MPP Knowledge, tuning etc…

    There is also a place for:
    * Denormalized models (once the data has been assembled in to an analytic format, has been combed, cleaned, corrected and aligned) – sure, no problem! This is where the business users can & should “play”…

    The business users belong in the Data Lake, not in the Data Dump, nor in the Data Swamp – that should be left for the Analysts (Now called Data Scientists… UG another term I dislike).

    By the way, I don’t like the term ETL – ETL as a technology is basically DEAD. ELT is the right and proper way / proven way to scale, and move things forward. The Transformation needs to live where the data lives. ETL indicates a pipeline, and that has been proven to be “finished” in the market.

    What I’d love to see: since you don’t believe that Data Vault Models fit the Hadoop system, is for you to take the SAS study I linked to earlier, build a Data Vault 2.0 Model from the Star Schema they offer, then load it with data (the way they did). You would have to tune it appropriately, and then execute similar test cases against a “flattened star / single file / single Table”. I bet, you might be surprised at the results, especially if you do this using Spark or Tez, or technology stacks from MapR, or HortonWorks…

    I’d be curious to hear your findings.
    Dan Linstedt

  10. Jorge Ribeiro 2018/06/12 at 6:31 am #

    Any update on this ?

Leave a Reply