The interaction between schema matching and record matching in data integration

Gu, Binbin, Li, Xiangliang, Zhang X., Liu, An, Liu, Guanfeng, Zheng, Kai, Zhao, Lei and Zhou, Xiaofang (2017) The interaction between schema matching and record matching in data integration. IEEE Transactions on Knowledge and Data Engineering, 29 1: 186-199. doi:10.1109/TKDE.2016.2611577


Author Gu, Binbin
Li, Xiangliang
Zhang X.
Liu, An
Liu, Guanfeng
Zheng, Kai
Zhao, Lei
Zhou, Xiaofang
Title The interaction between schema matching and record matching in data integration
Journal name IEEE Transactions on Knowledge and Data Engineering   Check publisher's open access policy
ISSN 1041-4347
1558-2191
Publication date 2017-01-01
Year available 2016
Sub-type Article (original research)
DOI 10.1109/TKDE.2016.2611577
Open Access Status Not yet assessed
Volume 29
Issue 1
Start page 186
End page 199
Total pages 14
Place of publication Piscataway, NJ, United States
Publisher Institute of Electrical and Electronics Engineers
Language eng
Subject 1710 Information Systems
1706 Computer Science Applications
1703 Computational Theory and Mathematics
Abstract Schema Matching (SM) and Record Matching (RM) are two necessary steps in integrating multiple relational tables of different schemas, where SM unifies the schemas and RM detects records referring to the same real-world entity. The two processes have been thoroughly studied separately, but few attention has been paid to the interaction of SM and RM. In this work, we find that, even alternating them in a simple manner, SM and RM can benefit from each other to reach a better integration performance (i.e., in terms of precision and recall). Therefore, combining SM and RM is a promising solution for improving data integration. To this end, we define novel matching rules for SM and RM, respectively, that is, every SM decision is made based on intermediate RM results, and vice versa, such that SM and RM can be performed alternately. The quality of integration is guaranteed by a Matching Likelihood Estimation model and the control of semantic drift, which prevent the effect of mismatch magnification. To reduce the computational cost, we design an index structure based on q-grams and a greedy search algorithm that can reduce around 90 percent overhead of the interaction. Extensive experiments on three data collections show that the combination and interaction between SM and RM significantly outperforms previous works that conduct SM and RM separately.
Keyword Data integration
Record matching
Schema matching
Q-Index Code C1
Q-Index Status Provisional Code
Grant ID DP120102829
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: HERDC Pre-Audit
School of Information Technology and Electrical Engineering Publications
 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 03 Jan 2017, 10:31:40 EST by System User on behalf of Learning and Research Services (UQ Library)