Scalable relative debugging

Dinh, Minh Ngoc, Abramson, David and Jin, Chao (2014) Scalable relative debugging. IEEE Transactions on Parallel and Distributed Systems, 25 3: 740-749. doi:10.1109/TPDS.2013.86

Author Dinh, Minh Ngoc
Abramson, David
Jin, Chao
Title Scalable relative debugging
Journal name IEEE Transactions on Parallel and Distributed Systems   Check publisher's open access policy
ISSN 1045-9219
Publication date 2014-01-01
Year available 2013
Sub-type Article (original research)
DOI 10.1109/TPDS.2013.86
Open Access Status Not yet assessed
Volume 25
Issue 3
Start page 740
End page 749
Total pages 10
Place of publication Piscataway, NJ United States
Publisher Institute of Electrical and Electronics Engineers
Language eng
Subject 1708 Hardware and Architecture
1711 Signal Processing
1703 Computational Theory and Mathematics
Abstract Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called 'relative debugging,' that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data to reason about the state of the suspect program. While powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.
Keyword Assertion checkers
Distributed debugging
Parallellism and concurrency
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2014 Collection
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 3 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 11 Mar 2014, 10:30:58 EST by System User on behalf of Research Computing Centre