Semantic Data Services for:

 

Enterprise Data Interoperability

 

 

 

 

 

 

 

 

 

May 1, 2006

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Brooke Stevenson

MetaMatrix

Account Manager, Civilian Sales

bstevenson@metamatrix.com
What
we have here is a failure to Interoperate

A History of Data Interoperability

 

It well documented the Federal Government continues to struggle and fail at implementing large-scale information-sharing programs. Even though there are presidential directives to share information, OMB has promulgated the Data Reference Model, DISA promotes Net Centric Enterprise Services, DOJ spent 4 years creating the Global Justice XSD and has now joined up with DHS to develop the National Information Exchange Model (NIEM), the intelligence community has multiple efforts underway (DNI ESI, IC-MAP DoDIIS Data Layer), and there are many other efforts trying to address information sharing– the goal remains elusive.

 

The complexity of these information sharing programs is largely due to the dependency on a wide variety of disparate information sources; data silos independently designed and managed cross-department, agency, state, and local governments for other ongoing missions. There are significant roadblocks to accessing and integrating the data in these systems:

 

  • Dislocation – the access technology varies from system to system
  • Security – with typically source-specific protocols and policies
  • Disparity – with wide variation in data semantics, syntax and structure
  • Interfaces/reuse – information is not reused or shared
  • Data Ownership – different agencies want to maintain control over their data

 

So, can the interoperability problem be solved? Well, the government is on the right track with its decision to adopt Service Oriented Architecture (SOA). SOA is a set of policies, practices, principles and frameworks that allow for the encapsulation of data and processes as a set of software services with standard interfaces and protocols that can be accessed by a growing and ever changing community of information consumers.  By leveraging the industry-standard XML, SOAP, WSDL, and UDDI protocols, services can be published, discovered and used in a technology neutral, standard way – reducing the dependency on custom coding.  XML, the most widely used of these standards, is used to define domain vocabularies, such as GJXDM, NIEM, EXDL, HL7, C2IEDM and the list goes on.

 

Given the increasing interest and adoption of these vocabularies, the availability of commercial off-the-shelf (COTS) tools for producing data services conforming not only to these vocabularies, but to web service standards, means that the solution to the interoperability conundrum is at hand.

 


Semantic Data Services

 

Semantic Data Services leverage domain vocabularies and technology standards to provide secure real-time access to existing data sources. Providing the context for sharing information based on program needs, Semantic Data Services also support the dispersed data ownership requirement that generally exists for these programs.

 

Semantic Data Services are:

 

1.      Data objects with domain-specific semantics (vocabularies)

2.      Provide access to legacy information systems

3.      Accessible via SQL, XQuery, or SOAP requests

4.      Through JDBC, ODBC, SOAP or JMS interfaces

5.      Defined as: relational schemas, xml documents, or W3C compliant web services

6.      Discoverable via UDDI registries as WSDL or,

7.      Discoverable via standard ANSI 92 SQL

 

Semantic Data Services provide access to data sources that have been established independently by state and local governments, federal agencies, and commercial information providers. The disparate nature of these data sources present two key challenges to sharing:

 

1.      A variety of software platforms.

 

Many commercial technologies have been developed and used to integrate information across different data management platforms, typically using connector or adaptor frameworks. Some of the more well-known integration technologies are MOM, EAI, ETL, and EII. The best of these commercial technologies require very little custom programming, if any. Semantic Data Services eliminate custom coding completely.

 

2.      A range of vocabularies.

 

The vocabularies typically vary based on the mission of the data collecting entity and are unique to the vernacular of the agency, user community, and technical design team. For example the Baltimore City Police department may reference an individual as a “Suspect” within an Oracle database, while the FBI might use the term “Person of Interest” in an XML Document. The difference in terminology, “Suspect” versus “Person of Interest”, is a semantic problem. Understanding semantics is essential to interoperability.

 

With the lack of helpful tools, the magnitude of the semantic problem has been universally underestimated and is often the downfall of progress within a program. With the introduction of Semantic Data Services and other semantic mediation technologies, it is now possible to discover and manage semantic relationships across information systems in a timely manner and on a large scale. Adopting commercial semantic mediation technologies is the last technology gap the government needs to fill to support information sharing.


Understanding Semantic Data Services

 

Figure 1. Data Source Services (defined below) provide seamless access to disparate information systems. Business Data Services (defined below), as depicted, can be mapped from Data Source Services, but can  also be transformed from other Business Data Services as depicted in Figure 2 where the XML Document Interface is accessed by the TWPDES Business Data Service.

 

Semantic Data Services can be categorized in two ways, as:

 

1.      Data Source Services; facades encapsulating existing information resources and their native semantics, which allow for the seamless access, federation, and deployment of disparate data sources.

 

Data Source Services will likely be in the form of XML documents or Web Services as independent data owners migrate to SOA standards. Many Data Source Services are defined in the form of Relational Schemas, leveraging out-of-the-box connectivity to standard JDBC and ODBC databases across the local, state, regional, and federal organizations. Data Source Services also leverage connectivity to other data applications, such as spreadsheets, text files, ERP, CRM, and other proprietary applications.

 

Data Source Services, captured typically through automated wizards, contain knowledge of the underlying structure and semantics of data sources. Data Source Services provide an information abstraction layer that can manage data access and interchange in a uniform manner.

 

2.      Business Data Services; data objects or views transformed from Data Source Services, which support real-time information retrieval according to domain or mission-specific semantics. 

 

As the government continues to adopt Service Oriented Architecture (SOA) and evolve interoperability standards such as the Global Justice XML Data Model (GJXDM) and the National Information Exchange Model (NIEM), Business Data Services will most likely be defined in the form of Web Services that provide access to data in the form of XML documents. These documents will comply with pre-defined information exchange standards such as the Terrorist Watchlist Person Data Exchange Standard (TWPDES). TWPDES, produced by the ICMWG (Intelligence Community Metadata Working Group), is currently in the form of a GJXDM-compliant XML Schema.

 

Generation of Business Data Services is also semi-automated by wizards. Represented as Relational Schemas, XML Schemas or Web Services, these services can be reverse-engineered from XSD files, WSDL files, or 3rd party modeling tool, such as Erwin, Popkin, or Rational Rose.

 

Business Data Services are defined with mappings to one or more Data Source Services, as depicted in figure 1. This is where the details of data integration are defined, including such commonly needed data reconciliation functions as name, attribute, and data type conversions.  These services represent combined, transformed views of the relevant data sources.  Although Business Data Services are even more abstract in nature (there exists no data source with this schema instantiated as its model), they appear as concrete as a Data Source Service when they are deployed. 

 

Business Data Services are a subset of a broader set of Business Services an interoperability program will be required to support. The primary other type of Business Service is a Business Process Service, which is generally created and deployed with MOM, EAI technology, or on an Enterprise Service Bus. Business Process Services will leverage Business Data Services for domain-specific access to disparate information sources.

 

We have now defined a repeatable model for supporting interoperability across any arbitrary number of information sources:

  1. Capture all relevant Data Source Services.
  2. Build Business Data Services to support relevant domains and reporting needs.
  3. Transform this abstract Business Data Service from the relevant Data Source Services.
  4. Deploy the services for SQL, XQuery or SOAP access from a web portal.

 

 


Semantic Data Services for “Working Groups”

 

The burden in managing information access in a shared environment is typically split between the centralized program team and the data owners. Programs often form “Working Groups” to encourage frequent communication and joint decision making between these independent groups. Semantic Data Services provide a tool to facilitate a very granular level of communication regarding what information each party will provide or consume and how.

 

However, often the sticking point for providers is that ultimately they also have the burden of supporting legacy applications that leverage the same data that is desired for the sharing initiative. Business Data Services are an abstraction mechanism that can be used by data owners and/or the centralized program team to control who sees and has access to what information. Specifically, when data owners prefer not to provide direct access to databases or data applications, they can use the concept locally to define XML Documents and/or Web Services in an effort to better control the shared access.

 

 

The Watch-list Example

 

The Semantic Data Services approach can be used to tackle the Terrorist Tracking problem. Given TWPDES (Terrorist Watchlist Person Data Exchange Standard), an existing XML Schema, a Business Data Service can be generated to expose all “person of interest” information. The TWPDES Business Data Service represents a combined, transformed view of the various government data sources (local, state, regional, and federal) across which we can build a complete interoperable picture for tracking terrorists.

Figure 2.The process for generating a TWPDES-compliant Web Service.

 

The process for generating TWPDES-compliant Web Services involves several steps, most of which are automated within Semantic Data Services COTS.

 

1.      Import of the existing TWPDES.xsd

2.      Reverse-engineering of the Data Source Services

3.      Generation of a TWPDES XML Document (Business Data Service), which is mapped to Data Source Services

4.      Automatic (if a pre-defined WSDL exists) or manual generation of TWPDES Web Service (Business Data Service), which is mapped to the TWPDES XML Document

5.      Automatic generation of WSDL (if it was not pre-defined) and deployment to a UDDI registry

 

Leveraging the power of a metadata-driven architecture, rather than costly and brittle programmatic solutions, model-driven COTS provides an extensible and scalable infrastructure for defining, relating, and accessing disparate government data sources.

 

 

Semantic Interoperability: Automatically Resolving Vocabularies

 

The Federal Government will benefit by adopting an automated approach to semantic mapping of disparate schemas and vocabularies. TWPDES, a discrete information exchange packet, has over 1,000 core entities. The NIEM Universal schema, one of many within NIEM, has over 100,000 core entities. No team of human beings will ever be able to map tens of thousands of entities across hundreds of data sources. Even programs that only deal with a dozen data sources yield hundreds of thousands of potential mappings. This mapping effort alone has stopped many interoperability programs dead in their tracks.

 

Automated semantic matching, a necessary component of any Semantic Data Services product, provides automated semantic mapping technology to aid domain experts in more quickly reconciling the semantics across a dispersed information environment. Automated semantic matching technology, an extensible ontology-driven tool, implements a variety of sophisticated algorithms for determining semantic equivalence. It leverages the previously defined Data Source and Business Data Services to aid in more rapid deployment of a mediation solution by automatically exposing potential semantic matches.

 

Automated semantic matching discovers similarities between elements of heterogeneous structured data sources, Data Source Services with departmental specific vocabularies. Automated semantic matching also supports matching elements of data sources, represented as Data Source Services, to target schemas of Business Data Services, such as TWPDES or any other GJXDM, NIEM, C2IEDM, HL7 or other schema. The key steps in running an automated semantic matching process are:

 

  1. Identification of formal vocabularies for each of the Semantic Data Services captured  
  2. Selection of the Services to be matched.
  3. Running the "Matching" wizard and selection from a variety of algorithms
  4. The results of this wizard are the "match results".

 

Examples of potential matches that can be automatically discovered with automated semantic matching are:

  • "PersonSexCode" matched with "GenderId" and "GenderDescription" based on the semantic equivalence of the words “Sex” and “Gender”.
  • "DayOfMonth" will also match to "Calendar_Date" based on obvious semantic relationships between the words used within these data element names.

 

 

Figure 3. Semantic similarities are revealed across data sources using the semantic relationships stored within a pre-populated upper-ontology. The pre-populated upper-ontology is essentially a combined Dictionary and Thesaurus of the English Language. This upper-ontology can be extended or replaced by domain specific ontologies.

 

 

 


Implications for Efficient and Effective Information Sharing

 

By capturing the necessary transformations and mappings to define Business Data Services (from Data Source Services) that can then in turn be used to directly drive integration, COTS can provide federation and semantic capabilities that are highly efficient (by leveraging existing agency data sources) and extremely flexible. Automated semantic matching functionality, with an extensible upper-ontology (see figure 3) and match engine, greatly increases the efficiency with which the mappings can be discovered.

 

Semantic Data Services products consider and provide the following benefits:

  1. Time and Cost Effective:  Semantic Data Services provide an efficient repeatable model such that new domains or business functions, in the form of Business Data Services, can be deployed in a matter of weeks, as opposed to months or years, with very minimal cost and no custom programming.
  2. Elimination of Data redundancy:  Semantic Source Services, as abstract entry points to existing information systems, can provide a means for real-time retrieval and update of data.
  3. Data Ownership:  Semantic Data Services provide a way to leverage existing data sources regardless of location, structure, or semantics, while allowing existing owners to maintain control of the systems.
  4. Web-based:  Semantic Data Services can be accessed via a number of standard interfaces (JDBC, ODBC, and SOAP) by any off-the-shelf web application/portal or a home grown tool.
  5. Open: Semantic Data Services COTS should be Java-based software, but also provide data in standard formats such as Relational, XML, or Web Services.
  6. Scalable: Semantic Data Services COTS can be easily and quickly integrated with other data storage technologies and systems and can be distributed across CPUs to maximize support for a large number of simultaneous users. 
  7. Model-driven architecture: Semantic Data Services are defined as standard-based models and many formats are supported out-of-the box and transformation between all these standards can be automatically defined. The advantage of this capability is that the standardization of data elements becomes less of an issue.  Thus, a model-driven integration of data exhibits platform independence.
  8. Re-usable: Semantic Data Services can be reused for both design efforts and data access needs. Data Source Services are, by their nature, reusable for creation of a number and variety of Business Data Service. The vocabularies and transformations are also reusable to speed up the definition of Business Data Services and the matching process.
  9. Change Management: Impact analysis can be performed across all Semantic Data Services when changes are made to: underlying sources, program requirements, or information exchange standard vocabularies. Semantic Data Services support full version control.