Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract and link meaningful facts. Current techniques primarily focus on performing pair-wise attribute matching and pay little attention in discovering direct and weighted cluster correlations for linking semantic equivalent datasets. We identify such problems as four major research issues in Data Linkage: associated costs in pair-wise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations.
In this doctorial dissertation, we introduce a new multi-faceted classification technique for performing structural analysis on knowledge domain clusters, using a novel Ontology Guided Data Linkage (OGDL) framework. In order to support self-organization of contributing databases through the discovery of structural dependencies, we introduce a series of algorithms for performing multi-level exploitation of ontological domain knowledge relating to tables, attributes and tuples. These techniques are of great help for automating the discovery of schema structures across multiple databases, based on the use of direct and weighted correlations between different ontological concepts, using a novel h-gram (hash gram) record matching technique for concept clustering and cluster mapping. Moreover, through a set of accuracy, performance and scalability experimental tests run on real-world datasets, we demonstrate the feasibility of our OGDL algorithms and show that our framework runs in polynomial time and performs well in practice.
Data Linkage is an important enabling technology in eHealth as linked data is a cost effective approach towards advancing research outcomes into health policies, detect any adverse drug reactions, reduce costs, and uncover any non-practices within the health system. Hence, to illustrate the efficiency and effectiveness of OGDL in real-world applications, we comprehensively used clinical risk management domain as our practical example. For this reason, we further extended our OGDL framework and introduced a composite clinical risk management success indicator data linkage, which consists of clinical risk factors combined with clinical resource and intervention factors that have shown to be associated with good and safe patient outcomes and with quality health care. The aim is to introduce a novel primitive upper ontology for semantic interoperability of health data and subsequent clinical risk management, and use it to map patient case data to reason about problems and solutions. Our experiments are performed on the Australian emergency medicine clinical trial datasets, demonstrating an effective method for the creation of a new risk management approach using semantic interoperability and reasoning.
The main contributions of this thesis include: introducing a novel h-gram record matching technique highly reducing the number of comparisons required in determining entity similarities, providing a highly effective and efficient OGDL framework for querying and integrating heterogeneous databases in the presence of data uncertainties, demonstrating an effective method for identifying how different sets of tables, attributes and tuples can be linked with the primary aim to understand the past and predict the future, providing a method for discovering ontological instances in domain specific clusters that reveals how different sets of information is organized to support information flow, introducing a novel primitive upper ontology for semantic interoperability, and finally supporting the development of a best-practice clinical practice guideline assessment framework with evidence based on the collaboration platform’s health knowledge repository.