Approximate record matching using hash grams

Gollapalli, Mohammed, Li, Xue, Wood, Ian and Governatori, Guido (2011). Approximate record matching using hash grams. In: Myra Spiliopoulou, Haixun Wang, Diane Cook, Jian Pei, Wei Wang, Osmar Zaiane and Xindong Wu, Proceedings of the 11th IEEE International Conference on Data Mining Workshops (ICDM 2011). 2011 IEEE 11th International Conference on Data Mining (ICDM 2011), Vancouver Canada, (504-511). 11-14 December 2011. doi:10.1109/ICDMW.2011.33

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads

Author Gollapalli, Mohammed
Li, Xue
Wood, Ian
Governatori, Guido
Title of paper Approximate record matching using hash grams
Conference name 2011 IEEE 11th International Conference on Data Mining (ICDM 2011)
Conference location Vancouver Canada
Conference dates 11-14 December 2011
Proceedings title Proceedings of the 11th IEEE International Conference on Data Mining Workshops (ICDM 2011)
Place of Publication Piscataway, NJ, United States
Publisher IEEE
Publication Year 2011
Sub-type Fully published paper
DOI 10.1109/ICDMW.2011.33
ISBN 9781467300056
9780769544090
Editor Myra Spiliopoulou
Haixun Wang
Diane Cook
Jian Pei
Wei Wang
Osmar Zaiane
Xindong Wu
Start page 504
End page 511
Total pages 8
Collection year 2012
Language eng
Abstract/Summary Accurately identifying duplicate records between multiple data sources is a persistent problem that continues to plague organizations and researchers alike. Small inconsistencies between records can prevent detection between two otherwise identical records. In this paper, we present a new probabilistic h-gram (hash gram) record matching technique by extending traditional n-grams and utilizing scale based hashing for equality testing. h-gram matching highly reduces the number of comparisons to be performed for duplicate record detection applicable to a variety of data types and data sizes by transforming data into its equivalent numerical realities. One of the key features of h-gram matching is that it is highly extensible providing more intuitive and flexible results. With the sampling technique in place, our method can be applied on variable size databases to perform data linkage and probabilistic results can be quickly obtained. We have extensively evaluated h-gram matching on large samples of real-world data and the results show higher level of accuracy as well as reduction in required time when compared with existing techniques.
Q-Index Code E1
Q-Index Status Confirmed Code
Institutional Status UQ

 
Versions
Version Filter Type
Citation counts: Scopus Citation Count Cited 1 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 21 Feb 2012, 17:58:00 EST by Mr Ian Wood on behalf of School of Information Technol and Elec Engineering