if you’re like me, you’re always wondering how the next “big thing” will fit in to the world that we have; in other words – how do i maintain my current investment in data warehousing, but add to it? in this case, adding unstructured data is the mechanism in question. well, i’ve had many sessions with bill inmon and others to discuss this very topic. there is in fact an answer. as always – there are some moving parts and pieces, but nothing that can’t be overcome. i’ll try to dispell the myths here, uncover some truths, and attempt to share a how-to lesson on getting unstructured data to work in the structured world.
if you’ve not heard of the data vault model, please read the basics of the data vault for a better description to get an idea. in this post, when i refer to data warehouse, i am specifically referring to a data vault modelled data warehouse.
there are some assumptions about unstructured data (ud) that we need to go through first:
- ud must be mined for content and context, it’s the results of that mining that are important to hook in to the structured world
- ud can take many forms, and there’s an argument about what is, and what is not ud. my meaning of ud is as follows: word-docs, excel spread-sheets (semi-structured), e-mails, images (jpg, png, gif, etc…), text documents, movies, and audio files.
- ud should remain in the file system, putting it in a data warehouse is slow and cumbersome (it can be done if you have the time).
- it’s the results of the data mining/statistical analysis that are important to align to the structured data world, being able to “query” the ud in combination with the structured data is very important.
now that we have that out of the way, let’s discuss ud.
ud by nature is raw data, and data mining or “analysis” of the ud to arrive at the results or statistics that will be placed in the structured world is equivalent to business rules.
because if you use a mining algorithm, you are not guaranteed the same result set for two different passes of the ud. the results may be close, but they won’t be the same, even if the ud hasn’t changed.
so the first problem is: how do you assign a “business key” to the ud you are surfing through?
some algorithms use a checksum, which is a good start for indexing purpuses, in most cases, i would suggest a good business key would be the full title of the ud plus the version number (or date of production/modification).
so you’re saying right off the bat that by putting ud in to a raw data vault, you are passing the ud through business rules? yes, that’s exactly what i’m saying. doesn’t that break auditability? no – the source document is the “version of the facts”. what’s important at this point is that the server doing the processing of the ud store a full copy of the data, or at least have the ud in a single version control system where it can be backed up and restored.
so should we put it in a raw data vault or a business data vault downstream?
it’s real fit is in the business vault, but if all you have is a raw data warehouse, then the results must be marked by “source-system” just like the rest of the dv. from there, it’s linked in to the rest of the data warehouse using many-to-many relations (link tables). these links can & should be dynamic (in other words, this is where the technology hasn’t gotten to yet)… the link should be formulated by algorithms which evaluate query capability. why? because the results have to be pivoted.
hold on, there’s one more reason…. the ud mining engine is the business application of the unstructured data world – which means it’s equivalent to the oltp app sitting in the structured side, which means the ud that it sources is the system of record (which is why it’s important to store the source in a version control system).
what should the results of ud mining look like?
in my opinion, they should contain the following: document name & title, location of source, discovered context, raw term, correlation rating to context, strength rating to context, and exact position within the document, and possibly a few other key notions. the mining engine should be capable of “clustering” terms together to form an idea, a context. these contexts provide clues to the notion of discussions or themes that are present in the ud.
by now you are all thinking: gee, a smart search engine should do the trick…
wrong. a search engine is an index component, and while the results / relevancy of search engines do good work, they only solve 1/8th of the problem (if that). the ud processing engine will actually be able to interpolate results in multiple languages, it will be able to use synonyms, homonyms, and antonyms to understand what the context of a sentence is (what the sentance idea is discussing). it will be able to pin-point the nature of the discussion with close proximity.
in other words, think about this: how many ways can you re-state the same idea using this sentence?
- the oil rig suffered a huge setback when the pipeline feeding the well ruptured from corrosion of the skin.
- when the pipeline ruptured, the oil rig stopped production; the pipeline was corrupted.
- the engineers noticed a ruptured pipeline after the oil rigs’ production stopped; after further investigation they found the pipeline had corroded skin.
- the line carrying oil began to leak, causing impacts to production of oil at a processing plant.
all four sentences are about a couple of ideas (centered on oil rig, and pipeline explosion). now, imagine these sentences worded properly in 15 different languages – and how many different variations of words that can be used to represent the same idea.
now, a search engine might tie together the first three sentences. the last one is about the “same” in context, a search engine doesn’t relate “processing plant” to “oil rig”, and it doesn’t relate the idea of “oil…leak” to “pipeline…ruptured”. this is where the ud processing engine + ontologies and data mining come in to play. it must “process” the ud to associate context without knowing what the question will be from the analyst point of view!! the only thing the ud tool should recognize is that these documents are about a particular industry vertical, so common terms from that vertical will be used in the ud to describe certain conditions.
what do the results of the ud look like?
they should be a modified key-value pair based solution set, coupled with statistical or mining analysis (sometimes flat-wide tables). this allows the “key” to be the interpreted meaning or contextual label of the ud, and the “value” to be the word used in the document. there are many other pieces that come in to play as well. but the point is, the data is “long” instead of horizontal.
how does the bi tool make use of this?
in it’s raw key-value format, it can’t. bi queries typically are looking for specific set terms (like “pipeline explosion” or “oil rig” or “damaged pipeline”), and they typically query in a single language (without regard to russian, english, french, etc…) so the mining engine must do language translations and commonalization of the ud terminology/context to make it work in native queries.
next, the structure must be fluid (because ontologies can be very large for each industry vertical), and when queried can produce cartesian results against the key-value pair table. so… the query must be analyzed for the “appropriate context”, and a link that identifies the part of the ontology to be used must be built on the fly. then, the link can join the appropriate part of the data vault model to the appropriate terms based in the key-value pair which then return the appropriate ud pointers to the business user.
you make it sound easy…
it’s not. the dynamic linking component is the hardest, because it too requires a “mining engine” that mines queries and associates content to query. furthermore, this process always takes time to accomplish, it’s not your 2 second or millisecond response time that you are used to. however, once the “link” has been built, it can stay in place until the ud results are updated, or the ontology changes.
this is the nature of getting ud into a structured world, and having it make sense to the business user. it’s a set of processes not a single process step.
what’s the relationship to the data vault?
it should be easy to guess now… the dynamic nature of creating and dropping links in the model as needed makes the dv model the perfect candidate for this venture. also, because the links can be built / combined of multiple business keys, they can help establish/support the contextual rating and terms that are being queried from the business side to the ud side.
i’ve got some quirky pictures that explain this process, i’ll try to publish them in a video on youtube soon.
what are your thoughts on unstructured data? ud and the data vault? how about search engines vs mining engines for ud?
love to hear what you’re thinking…