here we discuss the impact of “additional tables” on etl jobs. it really is negligable, but in this case – it would be good for you to read this entry.
if you have comments, thoughts, or other experiences i encourage you to add your comment to the end of this posting. as you know, i’ve launched one-on-one coaching, this is the kind of knowledge you get within the walls of one-on-one coaching. but for today – i’m giving you the answers free to show you the kind of value you get when you sign up for my coaching sessions. contact me today: firstname.lastname@example.org for more information.
how does the introduction of additional hub links impact existing etl jobs?
it doesn’t. all it does is “add” additional etl jobs to be executed, but get this: they’re fast fast fast, restartable, auditable, flexible and compliant. i’ve put 25 years of research and design into the etl that i generate with the saas services, and so by using the saas services you gain all of that knowledge – and guess what? it’s better than that… by subscribing to my coaching you get access to these services for free!! that’s right…. try it out for yourself.
take it from me, i’ve worked with data sets as large as 3 terabytes, inflow size of 20 to 40 terabytes per batch… these are not small numbers, and when you deal with data this large – you have one thing in mind: performance. their may be more etl routines to deal with, but each one is smaller, easier, faster and more nimble than anything you’ve ever experienced. and by the way, you can stretch the limits of the machines hardware by running parallel processes. you can finally prove to your business users that you are using all the hardware they bought you.
think about it, today’s typical etl loads take on average only 40% to 60% of the hardware per batch cycle during peak load. why? well, there are many many reasons for that (all are covered in the coaching section here) – but the main reason is because of complexity. having the business rules upstream, having multiple sources, multiple lookups, multiple targets – makes for a very messy etl later. one that requires constant maintenance and upkeep. i’m so wound up about this topic, it will take me 10 to 20 more blog entries just to describe the causes and effects… but the benefits of using the data vault approach and it’s corresponding etl are clear.
again, i refer to my above etl example: the “typical” federated data warehouse load cycle for 50 staging tables and 1 tb of data may take 3 to 4 hours. the same data vault load cycle will take 20 to 40 minutes, even though there are many more jobs running.