Data Vault Model & MPP Architecture

mpp – massively parallel processing systems, a beautiful site to behold.  there used to be a contingent (maybe there still is) that believed scaling smp would outperform scaling mpp.  (scaling up versus scaling out).  the hp superdome is one heck of a server, and a prime example of smp (symmetrical multi-processing).  the hp superdome i knew about had the capacity for 64 cpu’s on one motherboard with availability for 256 gb of ram.  although the specs have changed (increased) i still don’t see how it can out-scale the mpp systems.

now, if you take two, three, or four superdomes and scale them up to maximum, then tether them together as a shared nothing mpp architecture – then you have something akin to cloud resources.  of course, you have to add the software to handle all this architecture as a cloud and virtualized availability.  but the good news is: whatever you put on this configuration will be lightening fast (if it’s tuned properly).

enough talk, what parts of mpp drive the data vault model?

ok already, this is what i promised.  mpp is basically a divide and conquer architecture.  it is usually a shared-nothing set of components with a “director” which doles out the work (splits it up – sends it to each machine, then collects the results).  mpp offers the ability to truly run all processes in parallel across all machines.  homogeneous mpp machines provide the same exact configuration/hardware and software to the director.  heterogeneous mpp includes all kinds of different machines (even potentially in different geographical locations).  for instance homogeneous mpp architecture would house 4 or 5 of the same hp superdomes that i started the discussion with.  a heterogeneous mpp system (or a cloud system) may include some windows servers, linux servers, unix servers, and maybe even a mainframe or two.

mpp’s divide and conquer is really a strategy for splitting up work to be done.  rather than doing the work in serial, the work gets done in parallel (all the time).  the work could be a database query, it could be database sorting, it could be loading the database, or it could even be partitioning the data, re-organizing the disk, defragging the disk, etc. etc. etc…  usually parallel processing is just that: parallel activities or what was a “single process request” becomes a split task across the resources.  if the “answer” requested by the user is a “single answer” the director waits until all parts finish their tasks and provide a result.  it is at that point that the results are collated back into a single stream and passed back to the requestor (or user).

so how does this affect data modeling, especially the data vault model? 

the data vault model uses hubs and links and satellites to vertically partition (or divide the data set by specific column organization).  in other words, hubs being a list of pure/unique business keys have the following attributes:

  • shortest rows in the data vault, allowing for most rows per block, allowing for fastest possible disk scan access
  • potentially the most unique rows in the database, making compression a moot point – but enabling horizontal partitioning by either time, or range across the data dividing up the data set further, so that a query request can be further divided into parallel access, and only hit the parts of the disk where data is needed.
  • providing the most coverage of any indexing system the database provides.  the better the coverage of the index, the faster and more efficient the indexing systems are.

when related to mpp, hubs allow you to split the data sets across physical hardware without too much effort – in other words, all data belonging to “customers” can be assigned to a specific machine in the mpp stack, and will keep all the customer data together – because the hubs and satellites go together on to that mpp node.

links have a different functionality, and act a lot like something called a join index (provided by teradata).  i believe ibm db2 mpp edw has a similar functionality, but i can’t remember what it’s called.  in architectural spaces of database engines it’s generally called a “collocated index”.  in other words – you might have two different hubs and satellites: one hub (customers) + satellites resides on system a1,  one hub (employees) + satellites resides on system b2.  when you want to join customers to employees, neither one of the mpp nodes “knows” what data to retrieve without some sort of “hint” (that would be passing key rows across the mpp backbone to a1 and b2 systems – so that the “join” can correctly identify which rows to retrieve.

suppose now that customers is 45 million records, and employees is 10,000 records (quite small, but will work for this example).  there is a link structure between the two.  the physical location of the link table will make all the difference in the world for performance, as far as where it’s placed.  in terms of this case (as with most cases in mpp world), it is highly suggested that you place the link structure (collocate) with the employees table on the proper node.  placed with the smaller of the two tables.  this allows the join engine to “match” all potential customers first that have employee records by using the “small to large join table” ie: the link table on the b2 system.

after which, a small number of employee rows cross-over (are passed across the network) to the a1 system where the large table lives, then the join can finish executing against the customers table.

wait a minute, this isn’t how this works!

i know, i know!  this is a pure mpp system with no further partitioning.  you see, most mpp systems do one better than pure vertical partitioning, they want to spread the work across the machines to leverage the computing resources.  in this case, it is best to have a homogeneous mpp system, the director can only retrieve results as fast as “the slowest machine” in a homogeneous architecture.  so, one better – what do they do next to achieve an even distribution of data across machines?  they horizontally partition it.  generally by proprietary hashing algorithms.  in other words, there are some customer rows on system a1, and some customer rows on system b2, there are some employee rows on system a1, and some employee rows on system b2 – and they usually are not co-located unless the “set of fields used to compute the hash key” in both tables, are the same.

often, hashing customers the same way as employees would optimize the one single query: customers joined to employees, but would slow down all the remaining queries against the database.  so: it is not common, nor recommended to hash (nor even practical) all tables by the same sets of keys, resulting in “most average distribution across most average sets of queries”.  this is where the join index really helps.  in fact, multiple join indexes carrying hash key pointers, and indexed join key values are really incredible technology.

it begins to go against conventional wisdom here, it is recommended (because of the multi-partitioning going on) that you actually hash the join index according to the “larger” of the two tables, so that the large data set never leaves the mpp node it lives on unless it has a matching row on the other side.  (please correct me if i’m wrong, i’ve been wrong before, and this wouldn’t be the first time nor the last time for my mistakes… )

again, what does this have to do with the data vault?

well – quite literally nothing (joking), but wait!  there’s more!  actually, it has everything to do with the data vault because the data vault model is based on this principle, it enables horizontal and vertical partitioning of the data to occur on systems that are not mpp based – allowing mpp simulation to happen at the lowest levels of database parallelism.  did you miss that?  the mathematics of scale that back the mpp architecture, and the principles and applied math that backs the performance of mpp scalability, are inherited by the data vault architecture even on non-mpp systems.  by using the hubs to isolate and hash data locations, and using the links as a join-index to hash and collocate the data sets large and small, you can cross into an mpp design and use an smp system.

in other words, you can use mysql, sqlserver, or other database engines to make this completely possible, fast, efficient and scalable.  i’ll cover more on this in another post.

what about the satellites?  where do they fit in?

anyone who’s studied indexing and query performance knows that non-sparse indexes (maximum coverage of the data) is what you want.  repetitive data sets cause all kinds of performance problems to most database engines.  it’s the repeating data, along with the wide row sets that cause io’s to rise, table scans to rise, and performance to go down the tubes.  this is one of the reasons why: “denormalization of tables (reduction of the number of joins down into single table scans) has an upper limit.”  in other words, once you denormalize (make your dimension have too many columns), performance begins to take a nose-dive.

so again, the data vault, how does this apply and why mpp?

well, let me put it this way: the first thing you should be doing in any good data warehouse (data vault or not) is turning on column and page or row compression – especially for satellites in the data vault.  this greatly reduces “repeated values”, and increases index coverage dramatically.  this improves performance of the queries over ever growing data sets.  the catch?  it slows down load processing times – nothing we can do about that today.  that has to be solved by hardware manufacturers using ram caches, and smart algorithms.  anyhow, the second thing you can do with a satellite is back to partitioning principles – thus applying to mpp principle #1: divide and conquer.  basically split the satellites by either rate of change (of correlated data sets), or type/classification of data (one class might be by field data type, one class might be by “function” or business definition), etc…

by partitioning or splitting the satellites you do introduce more joins, but at the same time, you also increase the potential that parallel queries can be constructed to do the work.  then, it’s a matter of tuning the hardware and the i/o channels and data placement underneath.

what do you think?  does this help you?  i’ve spent 10 years in research and design of the data vault to address scalability and performance… please let me know if i’ve forgotten something.

oh yea, shameless plug: i do offer mpp systems design and architectural consulting – along with optimization of your current environment (even if it isn’t mpp).

dan linstedt

Tags: , , , , ,

No comments yet.

Leave a Reply