it’s been around for quite a while now, and it’s been improving in features and functionality. i’ve blogged over the years about using xml inside the database engines – and the need to embed this functionality directly inside the core. oracle, i am happy to say, has a wonderful solution to the problem.
oracle has got a piece called xdb, which appears to use some of the “file attach” functionality that has been available for a very long time. xdb has several features which make it a prime target / easy mechanism for configuring it as the data vault staging area – particularly if you are dealing with xml data sets, and you have oracle on-site as a database.
xml is inherantly thorny (as are xsd’s, dtd’s, the like). it is riddled with hierarchies, empty data sets, text-based tagging systems, recursive relationships, forward/backward references, optional structures, and so on. it’s sort of like the old hierarchical file systems (hfs) defined by cobol records, the only difference is that it is much more fluid. yes, as you know – it can change at any time. volatility makes dealing or ingesting xml into your data warehouse a tall order.
so, what’s the best way to deal with it?
i don’t know about the best way, but i do know about leveraging native database technology rather than using a 3rd party etl / elt tool to accomplish the tasks. i would strongly suggest that any time your data warehouse has to ingest xml, that you establish an agreement with the customer: they *must* provide an xsd, (xml schema document definition). once you have the xsd (even if it changes over time), it makes it that much easier to process and validate. without the xsd you are left to archane hand-coding to decipher the xml into something usable, very costly and very time consuming; or you use an etl tool and hard-code the structure to the tool making it difficult (if not impossible) to adapt to future changes.
so, the recommendation?
find yourself a good database engine like microsoft sqlserver 2008, oracle with xdb, db2 with xml capabilities, what ever the engine – with the ability to 1) absorb xsd’s for definitions, 2) attach an xml file directly to the xsd (tie it to the structure), and 3) provide sql/xquery interface so you can actually use a standard/hybrid sql query to get the records out of the xml document directly inside the database engine.
well, oracle’s xdb is impressive. it has the features needed, and is efficient enough to handle most batch refresh windows that we deal with. we are using the structured schema definition to make this work, and i am pleased as punch in how the database absorbs xsd changes and xml files. the good part about this, is we receive hundreds of xml files in the batch cycle, but none of them are very large (by themselves). this allows us to process/tie/attach these files in parallel processes – and they each process very quickly.
voila, our staging area is loaded and ready to go, simply by dropping an xml file into a specific directory. then, once we’ve moved the data from the xdb staging to the data vault, we simply move the xml files from the “staging directory” to the “archive directory”, and voila – they are automatically picked up for backup purposes, leaving the “staging structures” in the database as “empty.” no need to manage indexes, partitioning, or even truncation.
i applaud oracle for such an elegant solution, if you have oracle licensing in house and you are dealing with xml/xsd, then i suggest you check out their xdb options. here’s a link to some cool information.