GFilter: A General Gram Filter for String Similarity Search

Hu, Haoji, Zheng, Kai, Wang, Xiaoling and Zhou, Aoying (2015) GFilter: A General Gram Filter for String Similarity Search. IEEE Transactions on Knowledge and Data Engineering, 27 4: 1005-1018. doi:10.1109/TKDE.2014.2349914

Author Hu, Haoji
Zheng, Kai
Wang, Xiaoling
Zhou, Aoying
Title GFilter: A General Gram Filter for String Similarity Search
Journal name IEEE Transactions on Knowledge and Data Engineering   Check publisher's open access policy
ISSN 1041-4347
Publication date 2015-04-01
Year available 2014
Sub-type Article (original research)
DOI 10.1109/TKDE.2014.2349914
Open Access Status
Volume 27
Issue 4
Start page 1005
End page 1018
Total pages 14
Place of publication Piscataway, NJ, United States
Publisher Institute of Electrical and Electronics Engineers
Collection year 2016
Language eng
Abstract Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
Keyword Data integration
Gram-based framework
Similarity search
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2016 Collection
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in Thomson Reuters Web of Science Article
Scopus Citation Count Cited 0 times in Scopus Article
Google Scholar Search Google Scholar
Created: Sun, 03 May 2015, 01:15:22 EST by System User on behalf of Scholarly Communication and Digitisation Service