Public Sector Data Vault
The Data Vault has many uses in the public sector, ranging from Homeland Security to email capture, to dynamic data integration and unlimited scalability. The Data Vault is currently deployed in several public sector ventures, and is highly successful.
The following topics are covered on this site:
- Beyond Standard Data Warehousing
- The Needs of the Government
- Dynamic Structure Integration
- Massive Volumes in Near Real time
- Data Mining Relevancy Ratings
- Classified Information Stores
- Benefits to the Public Sector
- SEI / CMM and Compliance
- Learning Models, Nanotechnology
Beyond Standard Data Warehousing
In the government, there are needs which extend way beyond standard data warehousing. There are also more stringent requirements to be met, audits to pass, and classified information to hold. The standard data models (3rd normal form and Star Schema) don’t always meet these needs. The GIF (government information factory) is created as a framework, and provides guiding principles that help architecture experts begin to meet those needs. The Data Vault is an architecture that provides the integration “how-to” standards which are capable of nailing those needs in place.
As practitioners, we must begin to stretch our boundaries and build models, and content which allow flexible, scalable information stores. Limits must be eliminated, pending any hardware limitations that are artificially imposed and will be overcome with time. Thinking beyond standard data warehousing includes dynamic re-structuring of the content, unlimited scalability, application of data mining to both content and context.
The Needs of the Government
The government (public sector) has many needs. One of these needs is to have data integration stores that are stored separately, and only integrated during the analysis phase. Another is to integrate massive amounts of information quickly from all over the world. Other needs include unlimited scalability without heavy change impacts, keeping parts of data classified while reusing existing data models and leveraging unclassified domain information.
The government, due to privacy regulations, cannot conglomerate the data into a single information store. They are only allowed to analyze the information across information stores, and produce the integrated analysis, particularly in the case of air-port screening crossed with bank records and email traffic.
The Data Vault architecture allows separation of physical data, and integration of the information at analysis time.
“The government needs to connect relevant information systems so it can analyze data, share information, and ultimately predict and prevent future attacks.” (http://www.washingtontechnology.com/ad_sup/bestpractices/bestpractices5.html)
A dynamic data modeling architecture is needed behind the scenes to support flexibility, change, and discovery. The Data Vault provides for Data Mining of the structure and content together, to impute context and connections that previously, did not exist. Therefore the Data Vault paired with data mining operations can help discover unknown relationships across data.
There are many more needs that the government and any federal or international agency has. However, in the interest of time we’ve listed only a few more below: Security Mandates for separate physical information stores, brought together only during the analysis phase.
Single keyed information for faster access, and consistent integration.
Provide automated monitoring, and self-managing systems.
Achieve current administration goals quickly, and efficiently without “losing” or altering existing information structure and services.
Reduction/Elimination of duplicate data stores.
Dynamic Information Structures to meet ever-changing business dynamics.
Do more with less, allow smaller teams to handle bigger projects with ease.
Reuse and extend non-classified data and data model in a classified world.
Dynamic Structure Integration
If it were discovered that Rental Car Reservations were important, it would be easy to add the elements to store historical data, and then build the link to the other information in the discovery linkage. If it were discovered that phone calls had a relevance to Emails, or maybe the decision is to find out if there’s relevancy to emails – the addition of a second discovery linkage and a phone call history element is easy to do without disturbing the rest of the information. The modeling technique capable of dynamic addition of information without losing history. The architecture allows addition of information at any time, and the linking of the information to be tested for relevancy – if the information linked is not relevant (or no particular significance is discovered) the link can be dissolved without losing history of the information.
Additional Questions to Ask:
- What if links can establish correlations between disparate data in near-real time?
- What if each component can DISCOVER meaningful associations with other components in near-real time?
- What if you can weigh, measure, and produce a visual colored graph of important or “hot” linkages between information in near real time?
Disparate data sources such as bank-records, CIA systems, FBI systems, emails, and INS systems can be integrated rapidly and easily. For instance, linking emails to bank transactions, and airline ticket purchases becomes a much easier task (compared to previous methods). The architecture provides an unprecedented information linking (integration) capability. The flexibility of the design/architecture of the Data Vault is limitless. Because of the link mechanism, granularity of “joining” information together no longer is of consequence. The information is joined because it makes sense to the business, not because the technology says it can be.
Having the linkages allows dynamic integration of disparate data. Applying AI to the link can then discover if the link needs to be dissolved (lack of relevancy), or if a new link needs to be built between contextual elements because one doesn’t exist today and the information makes sense. From a pattern perspective, AI can discover new relationships and essentially evolve the data storage structures over time, learning which relationships make sense and which ones don’t.
Ultimately the data model begins to be morphed into what is relevant or critical to the business, be it intelligence, security, or commercial.
Massive Volumes in Near Real Time
Each set of elements is independent of the others. This allows the frequency of loading to vary without impact across the warehouse. It also isolates the growth patterns to the necessary data. The discovery linkage is run from a data mining process which learns over time – and only picks up the altered or new information. The structures in each area are part of what makes the architecture extremely powerful. The dynamic linking of the nodes make the information flexible, and scalable. The architecture is built for enterprise data warehousing, and is capable of storing massive volumes of information over time. **The white papers explain how this works.
The Data Vault structures are designed to allow near real time feeds of information. They are also capable of storing massive sets of information over time in the most efficient storage possible without utilizing compression techniques. For instance, suppose there is a need to capture emails in NRT, and bank transactions once every hour. The structural integrity does not break down under volume loads, nor does the integration of this information. The structural design allows the feeds to be set up as fault-tolerant, fully restartable processes.
Data Mining Relevancy Ratings
Data mining is extremely important, particularly to discover relationships among vast amounts of information. Using the Data Vault, a mining model can be built for the structural relevancy as well as the data inter-relationships. In other words, new cross-links can be “discovered” through data mining based on content of the surrounding information hubs. If data is seen to have a strong correlation, the mining tool may suggest new link structures to span the information hubs. The mining model results can also score the new relationship to assist in the interpretation of confidence levels that the relationship actually exists.
This can evolve the data model over time into a more effective and useful integration model, thus arriving at dynamic data integration, where not only the data changes, but the structure changes too. This can be monitored and guided (like a neural network), by manually approving the model changes, and/or changing the confidence level rankings that are stored within the model itself.
There is a need to discover relationships across information (utilizing data mining), and once discovered, judge the relevancy of the relationship. If the relevancy of the relationship meets a certain threshold, dynamically establish the relationship or integrate the data on-the fly. The data vault provides this capability of Dynamic Structure adaptation. For instance, the results of data mining show an 80% relevancy between occurrences of rental cars/trucks and the purchases of chemical items within the recent days by a suspected terrorist. A relationship across this data will need to be dynamically constructed, and analyzed to determine impact.
Questions that can be asked when Dynamic Data Structures can be built:
- What if relevancy can be viewed graphically across thousands of relationships?
- What if items that weren’t relevant to certain relationships could be dropped after interrogation?
- What if outlier activity appears in a discovery link across an extremely important relationship?
The functions can determine these items based on Data Vault relationships, and produce graphical or visual images which allow visual scanning of thousands of components of information rapidly. Massive volumes and high levels of integration can provide a data mining engine a goldmine of information. Sampling of the information isn’t good enough to find the outliers that one might be looking for – it may be the outliers (the unusual occurrences) that are interesting to the government. Pieces of information that are out of the ordinary. The Data Vault structures are architected to assist data mining tools to do their work IN the database. With massive volumes, it doesn’t make sense to extract the information out of the database – often there isn’t enough time to extract information simply to mine it.
Applying a combination of data mining and artificial intelligence algorithms, relationships can not only be discovered and built, but can then be weighed, measured and applied with a relevancy factor. Weak relationships can be noted, stored, or destroyed. Strong relationships can provide a cluster-building foundation. Strong relevancy between contextual data can be watched closely for outliers. When the outliers hit (actions causing strange results), they can immediately be flagged for further examination. This architecture supports parallelism all the way to the data structure level.
Is this a concept of Object Oriented Databases repackaged? I don’t believe so. These elements are independent of each other and carry context. Each could be construed as a base-level object with zero inheritance – that is one possible application, however the dynamic relationships are objects themselves that facilitate run-time reconfiguration of the data storage trees. It is not known if this effect is achievable through object-oriented design. This design does not employ the concerns and benefits of inheritance or polymorphism.
Information does not need to be evaluated for importance before being put into the Data Vault. The human “judgment” about what is relevant and the “politics” about which agency provides it (CIA or FBI) is now a thing of the past. The information is simply entered regardless of what it is and the Data Vault structural mining operation makes the relevant connections through the process of data mining and discovery.
Classified and Protected Information Stores
Many organizations require classified information, sometimes even the private or commercial sector likes to protect data, specifically financial data. In the case of classified data, it is absolutely vital to re-use the data model and the information that is unclassified; replicating it to the classified world, and have the flexibility of adding only those components that are desired without impacting existing structure, or information (content).
In the case below, the tables and areas in red show the components which were added to the classified model with zero impact to the existing structures, and zero impact to the existing data or content.
The first step is to make a copy of the unclassified data model, and build it behind closed doors. The second step is to build a one way data copy, which pushes unclassified information directly to the classified staging area, the classified staging area then engages “import” routines that pull the data onto the classified machine. From here, the cleared data modelers and business users can make additions and changes to the data model safely. They can extend the model into the classified world and add all of the information that is deemed necessary without “replicating” development work, and without altering original unclassified structures.
The structure of the Data Vault allows updates to be made to the unclassified data model and information set, and provides the ability to roll those into the classified world; quickly, easily, and without disturbing any of the content that has been built. Absorption of changes and upgrades is near 100%, with near 1% impact.
Benefits to the Public Sector
There are many benefits to the public sector, some of them have been covered above. Others include reduced cost of maintenance, reduced time to implement, standard and repeatable data architecture, elimination and reduction of redundancy (exception: classified world), increased worker productivity, better mapping of the business.
- Provide automated monitoring, and self-managing systems.
- Achieve current administration goals quickly, and efficiently without “losing” or altering existing information structure and services.
- Reduction/Elimination of duplicate data stores.
- Dynamic Information Structures to meet ever-changing business dynamics.
- Do more with less, allow smaller teams to handle bigger projects with ease.
- Ability to load hundreds of thousands of transactions rapidly per second into the warehouse.
- Dynamically extendable, adaptable to structure changes.
- Binding function to form. Utilizing data mining – patterns can be discovered, weighed and measured for applicability to the goal.
- Capable of structure changes without loss of information.
- Capable of dynamic/automated expansion and extension with new data sources.
The Data Vault architecture makes it easier to create systems that have 999999 uptime. Because it’s based on semantic business keys, it is easy to add and delete “areas” of business without losing data in other areas. The linkages make it possible to change the data model and the grain of the information quickly and easily with very little downstream impact.
When data mining is used on the content, it can “create” new linkage structures on the fly, connecting information that previously was unconnected. Thus modifying the model dynamically. It can also weigh the certainty of the connection that it created, and report these findings to the users.
Monitor and watch for “hot-spots” in data in near real time?
For example: would you be interested in dynamically discovering a large cash deposit, followed by several specific emails from a known/suspected terrorist, followed by purchase of airline tickets?
The benefits are many, contact us today to find out more about the Data Vault 2.0