Record matching with non-key attribute values

Yang, Qiang, Li, Zhi-Xu, Jiang, Jun, Zhao, Peng-Peng, Liu, Guan-Feng, Liu, An and Zhou, Xiao-Fang (2016) Record matching with non-key attribute values. Jisuanji Xuebao/Chinese Journal of Computers, 39 10: 2075-2087. doi:10.11897/SP.J.1016.2016.02075


Author Yang, Qiang
Li, Zhi-Xu
Jiang, Jun
Zhao, Peng-Peng
Liu, Guan-Feng
Liu, An
Zhou, Xiao-Fang
Title Record matching with non-key attribute values
Language of Title eng
Journal name Jisuanji Xuebao/Chinese Journal of Computers   Check publisher's open access policy
ISSN 0254-4164
Publication date 2016-10-01
Sub-type Article (original research)
DOI 10.11897/SP.J.1016.2016.02075
Open Access Status Not yet assessed
Volume 39
Issue 10
Start page 2075
End page 2087
Total pages 13
Place of publication Beijing, China
Publisher Zhongguo Kexueyuan Jisuan Jishu Yanjiusuo / Chinese Academy of Sciences, Institute of Computing Technology
Collection year 2017
Language chi
Subject 1712 Software
1708 Hardware and Architecture
1705 Computer Networks and Communications
1704 Computer Graphics and Computer-Aided Design
Abstract The advanced development of various technologies on social network, e-commerce and online education has contributed to an increasing amount of large-scale network data. Among all sorts of network analysis tasks, one basic task is to search important nodes in a network. Closeness centrality is one of the popular metrics which measure the importance of a node in a network. Based on the closeness centrality, the basic task is called top-k closeness centrality search. However, the existing exact approaches cannot process large-scale networks because of their polynomial time complexity. Recently, some approximation algorithms are proposed, which achieve high performance by sacrificing the precision of results. But according to our study, we find that the loss of the precision of results is too much. To improve the precision of results while maintaining the high performance, in this paper, we propose a Sketch-based approximation algorithm for fast searching top-k closeness centrality in a large-scale network. The new algorithm is developed based on a new computation method which calculates the centrality by estimating the number of nodes within a certain distance by a data structure called FM-Sketch. The new algorithm has time complexity O(m tDmax), where t is a constant, Dmax is the diameter of a network and m is the number of edges in a network. With the small-world phenomenon assumption, the Sketch-based algorithm is a linear algorithm. Finally, we compare our Sketch-based algorithm with the state-of-the-art exact and approximation algorithms through extensive experiments. The results demonstrate the advantages of the new solution.
Formatted abstract
摘 要 实体匹配旨在找出不同数据源中指代同一实体的实例.已有的实体匹配方法大都基于实体主属性值的相
似度进行匹配,而很少有工作考虑到使用实体的非主属性值来辅助实体匹配.然而,当两条指代同一实体的主属性
值差异较大的时候,这两个实体可能不会被认为是匹配的实体.另一方面,这两个实体很可能共享一些特别的非主
属性值,而这些非主属性值恰好可以反映出两个实体的匹配关系.基于这种思想,文中提出了一种新颖的基于非主
属性值的实体匹配算法.该算法以类似于决策树的结构为基础,通过使用这种结构,不仅可以解决噪声值和空缺值
带来的问题,而且可以极大地提高发现匹配记录以及尽可能早地排除不匹配记录的效率.多个数据集上的实验结
果表明我们的方法比现有的实体匹配方法具有更高的准确率和召回率.此外,使用我们提出的基于决策树的匹配
算法等有关技术较 Baseline匹配算法在匹配效率上高出10倍多.
Keyword Algorithm
Data quality
Non-key attribute
Performance
Record matching
Q-Index Code C1
Q-Index Status Provisional Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: HERDC Pre-Audit
School of Information Technology and Electrical Engineering Publications
 
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 0 times in Scopus Article
Google Scholar Search Google Scholar
Created: Tue, 08 Nov 2016, 11:01:19 EST by System User on behalf of Learning and Research Services (UQ Library)