Utilizing Real Time
Real-time data feeds to data warehouses and business intelligence systems can cause big problems. From people, process, and technology, all angles must be accounted for. Real-time data increases cost and complexity of systems on an exponential scale. Sometimes business users don’t understand just what they are asking for, other times business users simply don’t realize that what they want will cost a lot more money than they are willing to provide.
What are the problems involving people?
People require training and understanding. One simple fact is that BI and EDW are different, and the way most folks have understood these systems is in a batch mode. In a real-time environment, batch thinking no longer works. The resources involved have to consider different aspects of data arrival latency and the timing required to perform certain operations. For example: suppose you have an individual who is very capable and versed at loading several million rows at a time through business logic, and the routine has a two hour refresh/load cycle.
The business now needs 2 second updates delivered in a business queue/messaging system. The individual simply extends the “existing” load pattern to “hook” to a queue for input. BUT the run-time of the process is 25 seconds at a minimum per transaction, and it’s worse than that… some transactions never make it through because they are “waiting” for other systems, parent records, and external dependencies. Meanwhile more transactions are “piling up” in the queue creating a back-log of information waiting to get through the cycle of processing.
The point is: resources need to be re-trained. It is not possible to simply take what worked in a batch environment and “drop it in” un-modified to a real-time environment. It doesn’t work this way. Training for both the business leads and the technical staff is required.
What are the problems facing infrastructure?
Wow, infrastructure? Really? I thought what I had in place was fast enough. Well – maybe and maybe not. It depends on what you (the business users) can justify, versus what you’re willing / able to live with in terms of response time. You see, moving data around over wires, across motherboards, through systems, on and off disks at higher rates of speed costs money. It’s a sunk cost that depreciates every month – however what you should be evaluating is if you can reap the benefit of the sunk cost versus how much you have to invest.
For example: let’s say that it costs $250,000 to build an infrastructure that moves data around from server to server every 2 minutes; and today, your business can live with 2 minute updates. Everything is working great until someone discovers that you are losing million dollar deals because you can’t respond within 1 minute. The cost for building a 1 minute refresh infrastructure is $250k ^ x, where X is the exponent (depending on the people, size of effort, how outdated your existing infrastructure is, etc…) To move from 2 to 1 minute (double the speed) is exponentially more costly. To move from 1 minute to 30 seconds is again exponentially more costly.
Infrastructure upgrades are frequently required as the first step in solving real-time BI problems. Or to use a buzz-word term of the day: operational BI. These two are really one and the same (in my opinion).
What about all the business rules, don’t they get in the way?
Yes they do, and what a mess they make! Going back to our case, suppose we needed to process 8,000 transactions per second (ingestion rate). Our current process runs at 12,500 transactions per second – and at first glance, it feels safe. But there’s a hidden danger lurking in the shadows!! The process depends on lookups to two source systems that aren’t available until 10pm that night. Wow, this is a bummer. This data is then delivered to the next process in the stream, which runs at 4,000 transactions per second. The next process in the stream performs merging, mixing, matching, and some address cleansing or data quality cleanup - all before the data is ready to be consumed by the business.
What’s the maximum performance of transactions per second that these two batch processes have when strung together? Right, 4,000 transactions per second – because the second process depends on the output of the first process.
So, if transactions are arriving at 8,000 transactions per second and our second process can only deal with 4,000 transactions per second – what happens to our inflow? It’s backed up forever. But wait, it get’s worse. Remember that the first process can’t even start until 10pm that night? So that means if it’s 8am in the morning, you have 14 hours before the process starts to backup transactions in the queue…. Wow, and WHILE it’s processing (both processes) even more transactions get backed up. How is this real-time?
Business rules are the culprit here. The closer you get to near-zero latency, the less and less opportunity / time you have to run business rules over the data. In other words, it’s not called a stream for nothing!
So here’s one hard and fast answer folks: If you want real-time updates to your BI / EDW system, you must move the business rules downstream from the data warehouse!! By downstream, I mean: process the data when loading it in to the data marts, and let the raw data flow in to the data warehouse.
What? Dirty data in my warehouse?
If you want real-time then this is the reality you have to face – and YES, Dirty data will be pushed in to the warehouse as it absorbs the transactional streams as fast as they arrive. This is the only way to provide real-time feeds the direct information flow they need to continue operating as the business dictates. By the way, the minute you push a BI/EDW system into a real-time mode (operational BI), you have just dictated that it’s no longer a strategic system – but that it is operational. It now inherits all the responsibilities of an operational/transactional source system from system-of-record, down to 24 x 7 x 365 up-time and on-call operations people to keep it running.
But yes, dirty data will be in the warehouse because there is no time to clean it up, munge it, look up other values, alter it… you get the idea. On the other hand, now would be a good time for you to TAKE OWNERSHIP and decide to “fix” the dirty data where it needs to be fixed… In the SOURCE SYSTEMS. That’s how information quality really begins to take shape.
Wait a minute, hold on… Performance of my systems seems to be a problem. What can I do?
Many of my Data Integration folks say “The database is too slow”, and many of my systems folks say: “The ETL routines are too slow”, clearly there is a performance problem somewhere, how do I stop the finger pointing and get to the root of the issue? What IS the issue?
The name of the game when dealing with real-time is scalability. You need to become familiar with the terms: scale-out, MPP (massively parallel processing), and scale-free network. Scalability starts and ends with architecture. Put the wrong architecture in place, and eventually you have to kill the effort and start over – or re-architect the entire solution. For example, one day you decide to build a house… you lay a foundation for an 1800 square foot ranch house. Several years go by, and now you want a 5 story office building. Can you re-use the foundation for the 1800 square foot house? Most likely not. You have to completely demolish it, and lay a new foundation, a new architecture.
Data models are exactly the same as the foundation for the house. You have to choose the right architecture for the foundation in order to scale properly, and to enable the streaming data of real-time to reach the database in parallel without dead-locks and without binding processing to sequential operations.
The only data model architecture that we know of (proven to have scaled to 3 Petabytes and still growing), and that can handle 10,000 transactions per second times 10 feeds is currently the Data Vault Modeling components. The model has been architected to allow you to easily reach the level of real-time and scalability that you desire. Now don’t get the wrong idea, the Data Vault Model is not a product and it’s not for sale. In fact, it’s open-source architecture if you will, free and publicly available. We can help you along the way to understanding the nuances, rules, and best practices.
So what’s the bottom line?
The bottom line is this: Real-time takes time, and costs money. It requires training, knowledge, infrastructure, and at the root of it all – the right architecture. We offer you the ability to jump-start your efforts, take advantage of all the lessons learned, don’t re-invent the wheel. We offer you a unique coaching opportunity to learn and interact with skilled individuals who have built these systems successfully.
Click the links below for more information.
We offer a PDF download for business users and IT individuals for free. Download the PDF here…
- You can check out our coaching here.
- Register as a member for continual blog updates AND FREE tips & Tricks here
- Or simply, contact me with questions.


