this entry will discuss the nature of the changing market – with regards to big data and unstructured data. of course, the rise of use or storage of semi and unstructured data is part of the lead in to big data. i will dive in to an executive overview of where i think the market is going this year, what i think you should know as an executive, and why you should focus your talent and efforts in these areas. the data vault model and methodology is clearly a solution, but only a part of the solution – as there is more, much more to it than just a data model!
how can i deal with unstructured data? how do i get that information in to my warehouse?
then, once it’s in, how do i hook it together with the structured data? how do i deal with the data explosion?
defining unstructured or semi-structured data
well, in reality – one could argue that all data is structured – since it all reduces to ones and zeros, and from there, it is generally categorized as bytes, and from there, artificial or logical structures are placed on top of it. in reality, music (mp3 files, wav files, and others) all have a “specific structure” that they adhere to. so do pict, jpeg, mpeg, mp4, m4v, bmp and so on – as do doc, docx, xlsx, xls, and many other types. so is it really fair to call it “unstructured data”? well, for the sake of discussion, i think yes.
because the content that defines the context of the data is variable in both form and function. like this blog post for example, the content are the words and sentences that i use – the context is defined by the entire post (the overall theme or major points i am attempting to put across).
when we compare this with structured data, we find that there are specific columns defined by metadata or logical names like customer account number – which are supposed to house specific content, and even further – sometimes defined by a format mask or even a boundary constraint. structures and therefore structured data is usually much easier to understand, and therefore represent in a semi-static manner (meaning that the structure once defined, stays static for longer periods of time).
in the unstructured world, the content can shift every time an edit is made. in fact, the context can also shift – as can the visible structure of the document. in other words, i can change sentence order, headings, words, and even meaning of any document, email, or even music and images simply through editing them. again, as opposed to customer account number – where i can change the content but the defined meaning is supposed to stay residual. however, i am not taking in to exception the notion of field or column overloading – which clearly changes the context of the field based on the content.
so what does this all add up to for me in business?
so if you’re an executive in business, you’re probably reading this declaration and thinking: gee, that’s a lot of technical jargon – what does it all mean? or perhaps you understand what it all means, and you want to know how to solve the problems – so what exactly are the problems you are faced with when trying to make use of unstructured or semi-structured data?
- establishing an understanding of context
- tying multiple contexts together (those that are related are what’s truly important here)
- enriching the related contexts with additional structured data sets (or the other way around)
technically, what does this mean? what am i faced with?
well, there are many technical challenges – some of the top challenges i’ve already named, but here’s a list none-the-less:
- dealing with exploding data sets – huge volume / explosive volume
- moving the results of unstructured mining in to a structured world (perhaps)
- capturing the full unstructured content – then relating it through use of xml databases, triple stores, or key-value stores
- attaching the unstructured data sets to the structured data sets through dynamically allocated structures
all the while, ensuring your team remains agile, that your existing data integration (be it data warehouse, or ods, or alternative store) remains consistent, and up to date. making sure your business changes can and are reflected in the existing system in a timely manner. keeping your systems running at peak performance so that all these data are usable by the business and so on.
so, what do i need to know?
well, the answers are simple – getting there is not so simple. but i will do my best this year to try to explain the proper steps to getting to a successful implementation in this effort – so stick with me, and if you disagree with something i’ve proposed, please comment and provide me with feedback.
ok, first, as an executive, there are several terms and technologies you should familiarize yourself with:
- rdf, owl, and semantic web
- ontologies, glossaries, and business terminology with data governance
- data vault modeling, data vault methodology
- unstructured data mining
- xml data stores, triple stores, key-value stores, xpath and xquery
- and finally: context, and content and inference engines
there are several other secrets from a technical standpoint that will make your world rock with success!
you see, i’ve been teaching big data and huge data warehousing for years. i’ve also been teaching big data systems performance and tuning for about the same – maybe 22 to 25 years or so. i’ve collected a lot of knowledge and seen a lot of things – both done right and done wrong. so from a technical standpoint i will tell you this:
- invest in new infrastructure components – more specifically: ssd or ram drives for your critical or hot data assets
- invest in cpu or computational power – so you can either turn on compression or buy devices where compression is built in.
- invest in xml storage technologies and train your resources to deal with the technical components listed above! (get familiar with protoge and xml data stores)
- invest in a good structured data set data architecture (yes, the data vault model meets these needs)
because these things solve performance problems with very big data sets. i’m talking about 47% to 70% to 700% performance improvements, i’m talking about reduction in storage necessities of over 70%, i’m talking about reduction in i/o requests, and reduction in bandwidth. but i’m also talking about hooking in business assets and business glossary metadata – applying context to the unstructured world.
how do i do this? (at an executive level)
step 1: establish control over your structured database world, integrate your disparate silos, leverage the power of the data vault model & methodology to boost it agility
step 2: evaluate textual mining engines and their ability to provide context related answer sets spanning multiple languages – in preparation for next year!
step 3: invest in infrastructure as suggested above
step 4: train your work-force on xml, rdf, owl, business glossaries, ontologies and data governance
step 5: institute some policies around the items in step 4
step 6: hire a great consultant (as an adviser) who can help you select technology, and architect an unstructured data solution for you
step 7: move your unstructured data sets to xml if you can – convert them to owl ontologies, and rdf data stores and take the results of a data mining engine, and place them in a structured database following the data vault modeling techniques
step 8: use inference engines across ontologies to link common terms together, establishing dynamic and ever-changing linking tables between the structured world and unstructured data sets
in other words, use a combination of physical data vault link tables (set with confidence and strength ratings for similarity matches), xpath/xquery results against the raw documents, and standard sql queries against the structured world to produce incredibly well associated and well defined results.
i beleive this is your path to success with these technologies and of course, i can help you get there if you like. if you’d like to simply chat with me, pick up the phone and call me at 802-524-8566, i’d be happy to discuss these things with you.
what are the benefits of all of this?
well, from an executive level you probably are feeling pressure to:
a) integrate your existing structured silo solutions
b) make the existing it team more agile
c) begin incorporating unstructured data sets
d) establish a data governance program
e) establish data as an asset on the corporate books
f) find a way to integrate unstructured data sets
g) manage the ever-growing data explosion while still providing good response times
h) manage more and more xml data interchanges in business to business transactions
so the benefits are many, and of course all depend on how much effort you put in to following the instructions i laid out here. i hope to hear from you, either by phone, email, or comment on this blog entry – tell me what you see – or if i’m way off base! i’d like to know!
hope this has been helpful,
ps: you can learn more about the data vault model and methodology at: http://datavaultalliance.com
or call me at: 802-524-8566 (9am to 5pm eastern standard time)