VChunkJoin: an efficient algorithm for edit similarity joins

Wang, Wei, Qin, Jianbin, Xiao, Chuan, Lin, Xuemin and Shen, Heng Tao (2013) VChunkJoin: an efficient algorithm for edit similarity joins. IEEE Transactions on Knowledge and Data Engineering, 25 8: 1916-1929. doi:10.1109/TKDE.2012.79

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads

Author Wang, Wei
Qin, Jianbin
Xiao, Chuan
Lin, Xuemin
Shen, Heng Tao
Title VChunkJoin: an efficient algorithm for edit similarity joins
Journal name IEEE Transactions on Knowledge and Data Engineering   Check publisher's open access policy
ISSN 1041-4347
Publication date 2013-08
Year available 2012
Sub-type Article (original research)
DOI 10.1109/TKDE.2012.79
Volume 25
Issue 8
Start page 1916
End page 1929
Total pages 14
Place of publication Institute of Electrical and Electronics Engineers
Publisher United States
Collection year 2014
Language eng
Formatted abstract
Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.
Keyword Edit similarity joins
Approximate string matching
Prefix filtering
Edit distance
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ
Additional Notes Published online 10 Apr. 2012.

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2014 Collection
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 16 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Sun, 11 Aug 2013, 00:05:17 EST by System User on behalf of School of Information Technol and Elec Engineering