Sampling dirty data for matching attributes

Koehler, Henning, Zhou, Xiaofang, Sadiq, Shazia, Shu, Yanfeng and Taylor, Kerry (2010). Sampling dirty data for matching attributes. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10. 2010 ACM SIGMOD/PODS Conference, Indianapolis, IN, USA, (63-74). 6-11 June 2010. doi:10.1145/1807167.1807177

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
UQ221339.pdf HERDC Checklist - not publicly available application/pdf 60.05KB 0

Author Koehler, Henning
Zhou, Xiaofang
Sadiq, Shazia
Shu, Yanfeng
Taylor, Kerry
Title of paper Sampling dirty data for matching attributes
Conference name 2010 ACM SIGMOD/PODS Conference
Conference location Indianapolis, IN, USA
Conference dates 6-11 June 2010
Proceedings title Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10   Check publisher's open access policy
Journal name Proceedings of the ACM SIGMOD International Conference on Management of Data   Check publisher's open access policy
Place of Publication New York, NY, United States
Publisher Association for Computing Machinery
Publication Year 2010
Sub-type Fully published paper
DOI 10.1145/1807167.1807177
ISBN 9781450300322
ISSN 0730-8078
Start page 63
End page 74
Total pages 12
Language eng
Formatted Abstract/Summary
We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
Keyword Algorithms
Q-Index Code E1
Q-Index Status Confirmed Code
Institutional Status UQ

Version Filter Type
Citation counts: Scopus Citation Count Cited 12 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Thu, 18 Nov 2010, 01:58:20 EST by Dr Henning Koehler on behalf of School of Information Technol and Elec Engineering