Data Vault has Many Joins: Part 1

here is a copy of a posting on the data vault forums.  in the blog, i will get into the join specifics, the mathematics for the joins, what the joins do, how to optimize them, etc..

the next technical issue on the docket is:
data vault model introduces many many joins

this is true, very true – the dv model remains extremely flexible but produces forced joins across the model in order to get the information out.

business reason?
because the join represents a relationship between two business processes, or a transaction with specific elements existing as attributes. when the business changes, the model needs to change without disturbing the history, the model needs to represent the business as it stood in the past before the change. the business changes frequently, and the model should not impede the business from making changes.

why? because all relationships are driven through link tables (many-to-many tables). this is a part of the architecture which cannot and should not collapse in to 1 to many, nor many to 1 – if the relationships collapse we lose traceability, auditability, and flexibility moving forward.

the joins exist, that’s a fact of life. what can we do to get the queries to go faster? i’m covering this right now in my data vault book, but here are some brief thoughts:

1. install an mpp shared-nothing rdbms engine that is highly scalable and extremely fast, the dv has a propensity for working extremely well in an mpp environment with volume.
2. build data marts to get the data out, physically instantiate tables – run processes to load those tables as you go.
3. load the data marts at the same time you load the data vault, then run a background “check” process to make sure they synch up (beware, this method has many flaws, and can cost a lot in terms of architecture and horsepower utilization).
4. split the data vault into a copy of history, versus current – i don’t advocate this approach, but if it must be done, so be it. this approach requires separate data release areas like data marts and flat-wide tables for users that contain a full view of the data.
5. use bridge and pit tables to assist with the queries (the most common solution).
6. fold relationships where possible – experiement with higher level grain, and relationship folding, and system generated record sources.

we will go into query tuning techniques in the book (which will be done soon), and will help you with these notions.


so what exactly does it mean to the business?
why does it matter?  the data vault should not be accessible by business users (except those who are data mining).  those who are data mining have been taught how to run queries against the data vault.  all other access against the data vault is scheduled – process based access; and if it’s process based access – it can be tuned!  it is not the ad-hoc query environment without a really powerful mpp engine (like teradata, or netezza, vertica, or ibm db2 mpp).  in the mpp environments, parallelism is king, having the right hardware eliminates any worries of joins across the tables – most mpp platforms will “flatten” queries before executing them, then run many bits and pieces in parallel until results are achieved (most optimizers try to make the joins occur where the most optimal cost would be).

if the access is not ad-hoc, then tuning the queries and the access paths are easily “set and forgotten” until the model changes.  also, you can add indexes, horizontal and physical partitioning, index organized tables, in-memory tables, hash join indexes, point in time tables, and bridge tables to speed queries along.  all of this is entirely possible without destroying the flexibility of the data vault model.

now, back to the business part of this:
yes, the data vault has many joins – you can’t get away from this and still have a really flexible model.  from a business perspective what this allows is fast it reaction time when the business changes it’s requirements, adds a new system to the edw, or decides they want to represent relationships in a new way. 

from a technical perspective:
the joins in the data vault are a moot point.  why?  because of the way mpp works!  if you aren’t familiar with mpp (massively parallel processing), or how to setup your rdbms engine for it, then i might suggest that you find out how to implement it.  sorry for sounding brash, but in this day & age, this skill is critical for any dba on the job.  here is a real life cases:  in teradata, we had a data vault, and performed a 15 way join across the model with 5 terabytes of information in test – returning 1.2m records in under 4 seconds.  teradata didn’t even break a sweat.  there are ways to make ibm db2 mpp perform as well – but it relies on the right architecture, and the right operating system.  ibm db2 mpp performs really well on linux 64 bit (after it’s been tuned properly, and there is enough memory on many of the boxes to handle the caching/synchronization).  i’ve recently heard of a data vault on netezza doing really well (as in not needing physical data marts/physical star schemas to get the data out).  in this case, they used views to represent the dimensions and facts, and were able to pull data back very quickly (i am trying to get the performance numbers).

on vertica, the case is different – vertica and paraccel operate in different manners.  we have no test cases yet (that i’ve heard of) to provide information about data vault on these devices, but i’m sure it would do just fine, even at scale.

now, why shouldn’t the number of joins matter?
well, take a look at the definition of mpp:

massively parallel processing or massively parallel processor) a multiprocessing architecture that uses up to thousands of processors. some might contend that a computer system with 64 or more cpus is a massively parallel processor. however, the number of cpus is not as much the issue as the architecture. mpp systems use a different programming paradigm than the more common symmetric multiprocessing (smp) systems used as servers.

in an mpp system, each cpu contains its own memory and copy of the operating system and application. each subsystem communicates with the others via a high-speed interconnect. in order to use mpp effectively, an information processing problem must be breakable into pieces that can all be solved simultaneously. in scientific environments, certain simulations and mathematical problems can be split apart and each part processed at the same time. in the business world, a parallel data query (pdq) divides a large database into pieces. for example, 26 cpus could be used to perform a sequential search, each one searching one letter of the alphabet.

to take advantage of more cpus in an mpp system means that the specific problem has to be broken down further into more parallel groups. however, adding cpus in an smp system increases performance in a more general manner. applications that support parallel operations (multithreading) immediately take advantage of smp, but performance gains are available to all applications, simply because there are more processors. for example, four cpus can be running four different applications. see smp.,2542,t=mpp&i=47310,00.asp

as indicated, mpp divides tasks into parallel groups.  yes, it requires more and faster compute power – but in the end, division of work spread across computing power is what it is all about (for speed and performance).  just look at the rise of cloud computing which is mpp, but by having shared nothing resources available on-demand.

the data vault breaks the data up into “vertical partitions” so that it can be accessed in high-speed parallel processes.  if you are not familiar with horizontal and vertical partitions, please take a look here…

or here:

there are many architectural discussions by different database vendors that would provide information on these concepts.  the point is: vertical partitioning is not commonly “done” automatically by relational/traditional database vendors.  it is done by appliance vendors (datallegro (now microsoft), paraccel, vertica), ie: column based databases.

if you have an architecture that vertically partitions columns at the right spots – you can have indexes that provide maximum coverage, but more importantly, you can provide the engine with low-cost parallel query options that it didn’t have before.  also, by squeezing the unique values into shorter rows (hubs & links are shorter rows), you get maximum packing of data into disk blocks, making the reads much more efficient – and providing the b+ tree traditional indexes with binary search ability.  when the data is replicated, your engine will go after the satellites – this is where the largest difference happens.  satellites not “printed or used” in the query can be dropped without cost, satellites used in the query are already split apart by type of data or rate of change.  this makes the patterns more “repeatable”, which in turn (when you turn on compression) makes compression ratios multiply (much much higher).  it also increases the performance of the query dramatically. 

never mind the fact that you have more tables,  again, if you have parallel query available, turned on, and tuned properly – the database shouldn’t even break a sweat until it hits 100 terabytes or so, and even then – enough compute power to handle the parallelism, and you’re right back to 4 second response times – no matter how big the data set gets.

you see, the data vault modeling techniques are based on the mathematics of the mpp architecture, and parallel processing.  basically the enabler for all of this is vertical partitioning in the right places (business keys for hubs, associations for links, and descriptive data for satellites).    no other modeling technique (save one: anchor modeling) brings you this type of parallelism and query ability.  the old saying is: if you have a large problem to solve, divide it and conquer it.  why shouldn’t we apply this logic to our ever growing data sets?

if you want to understand the mathematics behind the data vault, study mpp, study vertical partitioning & horizontal partitioning, study parallel processing versus symmetrical processing.  these concepts are built into the architecture of the data vault model, and when the rules of the data vault are followed, and the hardware and rdbms are tuned properly, then success is sure to follow.

now, let me say this: when would a 3nf out-perform a data vault?
well, in situations where there are no “repetitive data” within the tables, when the indexing coverage is good, and 99% of the data set is “current” or operational in nature.

when would a star-schema out-perform a data vault?
in situations where the rdbms engine is not tuned properly for mpp, in situations where clustered harware, numa architecture are used, in situations where the rdbms does not/cannot execute parallel query, and sometimes when the rdbms offers “star-join” core engine optimizations…. but as your data grows, it will outgrow these architectures, and it will be forced into the mpp arena.  then, the data vault will out-shine other data modeling techniques because it is designed for mpp.

does this mean i have to have a large data set to use the data vault?
no, absolutely not.  there are a number of benefits for using the data vault modeling in other smaller situations – and yes, they require physically moving data from the data vault to star schemas to get the query performance needed in a business analysis situation.

anyhow, i’m getting beyond this post into another blog post… watch this space, and as always – if you have questions please ask!

Tags: , , ,

One Response to “Data Vault has Many Joins: Part 1”

  1. marius 2010/06/28 at 6:20 am #

    Hi Dan,

    When is the second part of this article coming?

    Kind regards

Leave a Reply