Statistical compression of protein folding patterns for inference of recurrent substructural themes

Subramanian, Ramanan, Allison, Lloyd, Stuckey, Peter J., de la Banda, Maria Garcia, Abramson, David, Lesk, Arthur M. and Konagurthu, Arun S. (2017). Statistical compression of protein folding patterns for inference of recurrent substructural themes. In: 2017 Data Compression Conference (DCC). Data Compression Conference (DCC), Snowbird, UT. United States, (340-349). 4 - 7 April 2017. doi:10.1109/DCC.2017.46

Author Subramanian, Ramanan
Allison, Lloyd
Stuckey, Peter J.
de la Banda, Maria Garcia
Abramson, David
Lesk, Arthur M.
Konagurthu, Arun S.
Title of paper Statistical compression of protein folding patterns for inference of recurrent substructural themes
Conference name Data Compression Conference (DCC)
Conference location Snowbird, UT. United States
Conference dates 4 - 7 April 2017
Convener IEEE
Proceedings title 2017 Data Compression Conference (DCC)
Journal name 2017 Data Compression Conference (Dcc)
Series Data Compression Conference Proceedings
Place of Publication Piscataway, NJ, United States
Publisher Institute of Electrical and Electronics Engineers
Publication Year 2017
Sub-type Fully published paper
DOI 10.1109/DCC.2017.46
Open Access Status Not yet assessed
ISBN 9781509067213
ISSN 1068-0314
Volume Part F127767
Start page 340
End page 349
Total pages 10
Language eng
Abstract/Summary Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at
Subjects 1705 Computer Networks and Communications
Keyword Minimum Message Length
Protein structure
Super-secondary structural patterns
Q-Index Code E1
Q-Index Status Provisional Code
Grant ID DP150100894
Institutional Status UQ

Document type: Conference Paper
Sub-type: Fully published paper
Collections: HERDC Pre-Audit
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in Thomson Reuters Web of Science Article
Scopus Citation Count Cited 0 times in Scopus Article
Google Scholar Search Google Scholar
Created: Sun, 16 Jul 2017, 02:11:18 EST by System User on behalf of Learning and Research Services (UQ Library)