Why & When Data Warehousing? Is it Relevant?

there are many questions around data warehousing, ranging from when to do a formal data warehouse vs when to use a data mart/subject oriented star schema approach vs when to use federated now data.  other questions are why would you use a data warehouse today?  is it still a relevant approach in information management?  in this entry i will provide my opinion on the subject, and we’ll see what we can discover.  as always, i’d love your comments and feedback on the subject.

why should i use a data warehouse?

there are hundreds of reasons why a data warehouse is useful to your organization, i would suggest the following list be a good starting point:  (if you have these needs you may need a true back-end enterprise scalable historical data store : or enterprise data warehouse)….

  • real-time issues – your current systems aren’t enabled to integrate disparate sources of data and keep historical records of those integrations, in near real-time.
  • scalability issues – you have tons of historical data you need to gather in to an easily accessible place, common formats, common keys,  and common access methods.  and you need to ensure that the system is scalable over the next 3 to 5 years.
  • avoidance of siloed solution sets – if you have many different or disparate solutions already in existence, yet your corporation is unable to answer common questions requiring consistency across your enterprise.
  • enterprise class system of record – across historical and integrated data sets, if you have a need to do this, you probably need an enterprise data warehouse
  • disparate source systems along with internal and external data sets – if you need to ingrate all of these for a single enterprise vision with history, then you need a data warehouse.
  • self-service bi – if you have a need to eventually reach this goal, where users can “visualize” and construct their own reports, then you probably need an enterprise data warehouse, along with it’s highly integrated historical facts from all the different sources in your organization.
  • kick start for a master data management initiative.  if you want master data, then it is important to understand the nature of your history – where the problems exist, how the data does and does not align with business perception, and basically where to “get” the golden copies of records you want to begin populating your mdm solution (remember: mdm is not just a tool, it’s people, process, governance, and so on)..  yes, you can build mdm solutions without a data warehouse, but how good is your confidence that the data you selected is truly “gold copy” if you don’t have historical evidence to back it up?
  • if you do any sort of data mining, you need a data warehouse.  data mining is becoming (or already is) the heart-and-soul of better decision making in bi.  and of course, the mining engine is only as smart as the domain of information that you provide to it, along with the model that is designed.  statistics say: you can project for 1/2 as much time as you have history for.  so: with 2 years, you can project (with some accuracy) only 1 year out.  the same goes for data mining initiatives, and the better interconnected the data set is (by business keys across the enterprise) the better your data mining confidence ratings will be.

when should i not build a data warehouse?

if you have a need for a proof of concept, a quick win, or a single subject area of answer sets you need to deliver tomorrow without regard to an enterprise view, and without regard to any of the other issues listed above.  in other words, build a single data mart / single star schema for this purpose.  but be aware, that unless the data mart is “loaded with raw data”, that it will drive your future towards silo building and disparate information systems.

or if you have a need for only now or current transactional data, in these cases, federation and federated query engines are what you desire – or quite possibly, just more operational reports.  then you wouldn’t need a data warehouse.

eventually though, if you truly want deep analytics on the history of your company and all the data that it “uses/consumes and produces” daily, then you should consider a data warehouse now.

do not confuse data warehousing with non-agility…

too many people i see today think: “data warehouse = big project, long time to implement”  that couldn’t be further from the truth!!  i’ve got actual case studies where (after the requirements are written, and the infrastructure is in place) the data warehouse was scoped, designed, built, loaded along with 5 star schemas, inside of 2 weeks….  if that isn’t agile, i don’t know what is.

tool sets (being what they are today) can really help make your data warehousing efforts agile…  don’t make the mistake of thinking that just because you hear the term: “data warehouse” that it means it’s going to be a long, drawn-out implementation – it just simply isn’t true.

but wait…. there’s more!

now speaking of agility, once you have a data warehouse in place, there’s no reason that you can’t produce data marts even faster than producing them without a data warehouse.  i have actual cases where data marts can a) be defined in a 2 page requirements document and b) be delivered within 45 minutes after receiving the filled in requirements document from the business user…  based on enterprise data warehouse (enterprise class data sets).   ok, so this isn’t as quick as “self-service bi”, and maybe sometimes it’s not as quick as federated query engines…  however, it does give you one big advantage over both of these things…  you can use these tools to “absorb” the enterprise data warehouse, all the governance, all the integration work, and everything else that has been done.

when should i use a federated query approach?

this used to be called eii (enterprise information integration), basically all a federated approach is (in query land) is the ability to integrate by logical data model across multiple source systems, then – have an engine that is smart enough to split the query and understand where to go to get the data.  slap a front-end bi tool on the federated query engine, and you’re good to go right?  well, only sometimes! this works if and when:

  • the data set definitions are aligned
  • the data elements contain the same grain of data
  • the source systems are accessible (and not currently overloaded with oltp operations)
  • the data has arrived from your exeternal data sources
  • the flat files you might be accessing are properly structured to meet the query needs
  • the data governance, security, and login components are all in-place.
  • you only need now  data (what’s current)…  you cannot get history from a federated query approach unless the source system carries history, even then – it’s usually limited amount of transactional history.  to see the full enterprise integrated history, you need a data warehouse!

ok – so what are we talking about here?

the term: “data warehouse” is well over 20 years old, but it’s an evolving specification.  it is just a conceptual definition for a logical data store which houses integrated history, you know: time variant, non-volatile, functionally oriented, raw data, integrated by business key (the definition of a data vault model).

is it still relevant in information management today?

so perhaps the term “data warehouse” doesn’t do it for you anymore, perhaps the “buzz” of the term has lost it’s shine, but i’m here to say that the concepts are more needed now than ever before.   today’s data warehouse (when built properly & with the right architecture) can be a huge asset on the books, and is sometimes required by law (for auditability and traceability reasons).

if it weren’t relevant, we would no longer have jobs in this part of the industry, we would be working on something new, something different.  but i stand here today to say: the name or logical label “data warehouse” is alive and well, only the architecture and impementation specifications have changed.  as have the tool sets.

so now it’s your turn… do you think it’s relevant?  did i miss something here?  do you hold a completely different opinion?  if so, i want to hear from you.

i hope this helps,
dan linstedt
ps: you can find out more about agile data warehousing, and the new methods for building them efficiently and quickly at: http://datavaultalliance.com

Tags: , , ,

2 Responses to “Why & When Data Warehousing? Is it Relevant?”

  1. John Young 2017/08/02 at 3:58 pm #


    It’s been 6 years since this post. It still seems a relevant topic.

    I’m wrestling with a solution where the customer really only needs NOW data, there is little integration needed with other systems except some data from the master data management system, and business processes are (in theory) the same between the geographical locations that each have their own instance of a common vended application. It looks like a Federated Query approach could work well since the transaction databases are all replicated and available for querying. Currently the bulk of the challenge lies with the business rules that define aggregations and calculations and that flatten data out so it can be processed by statistical analysis tools which prefer flattened record sets.

    I guess I’ve been justifying data warehouses for so long, I get nervous thinking maybe I should avoid a DW and start with federation and see how far that gets us.

    My question is if after 6 years you have anything you would add or subtract from your criteria when it comes to weighing a Data Warehouse versus a Federated Query approach to bringing data together and generated metrics and aggregations for customers to use.

    John Young

  2. Dan Linstedt 2017/08/03 at 9:48 am #

    Hi John,

    Thank you for taking the time to read my entry, and make a thoughtful comment. I really do need to get back to blogging again! 🙂 Anyhow, that said, to answer your questions:
    * I would not add or subtract from my criteria, only repeat what I said six years ago: the platforms and implementations are changing. Some of my customers are now moving their data warehouses to hadoop, using Kudu, and Hive, and Impala, etc.. others are moving to Snowflake DB. There are still bugs and risks with KUDU by the way…
    * I would say: based on your situation, a federated query engine is no longer the right tool. There are only a few out there, I would rather re-direct your interests or efforts for tooling towards self-service BI (which six years ago was an emerging market). There are some really great tools that do what you are looking for (federate operational data), AND give you mining qualities. Including: Datameer, Lookr, Domo, and more… a few other tools are really good at the web-service (and self service without the Mining): QlikSense / QlikView, and Tableau for instance.

    BUT: again, if you need aggregations and generated metrics, especially over history, it is best to build a bonifide data warehouse.

    Hope this helps,
    Dan Linstedt
    Feel free to contact me directly: danLinstedt at gmail.com

Leave a Reply