Machine Architectures for Biological Sequence Classification

John Hawkins (2007). Machine Architectures for Biological Sequence Classification PhD Thesis, School of Information Technology and Electrical Engineering, The University of Queensland.

       
Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
n40314824_phd_abstract.pdf n40314824_phd_abstract.pdf application/pdf 31.67KB 0
n40314824_phd_content.pdf n40314824_phd_content.pdf application/pdf 1.40MB 0
n40314824_phd_front.pdf n40314824_phd_front.pdf application/pdf 102.68KB 0
n40314824_phd_totalthesis.pdf n40314824_phd_totalthesis.pdf application/pdf 1.45MB 0
Author John Hawkins
Thesis Title Machine Architectures for Biological Sequence Classification
School, Centre or Institute School of Information Technology and Electrical Engineering
Institution The University of Queensland
Publication date 2007-07
Thesis type PhD Thesis
Supervisor Boden, Mikael B.
Subjects 290000 Engineering and Technology
Formatted abstract
All machine learning techniques contain inherent biases. Certain aspects of the final decision boundary will be due to reasons inherent to the technique rather than drawn from information in the data. The selection of a machine learning technique for a particular problem requires some degree of sensitivity to the nature of the domain, so that the bias of the machine is matched to the problem.
In the field of bioinformatics there is enormous demand for classifier systems that categorise biological molecules on the basis of a linear sequence representation. However, our understanding of the machine architectures that are suitable for these problems is limited to heuristics drawn from other domains and processes of trial and error. Hence, the goal of this thesis is to create a mapping between the features found in biological sequence patterns and the available machine learning architectures.
All current machine learning approaches to classifying biological sequences, fall into one of two general classes of machine architecture: fixed-window and recurrent. The fixed-window machines process a pre-specified window of input symbols in a single step. The exact position of a symbol within the window defines the way in which it is processed, and hence contributes to the classifi¬cation. Recurrent architectures, on the other hand, process a sequence iteratively, using a smaller window that takes sub-segments one by one and uses an internal memory state to maintain a record of what has been processed in previous steps. The dynamical processing allows reuse of parameters and permits each element of the sequences to be processed in the context of what came before.
This thesis demonstrates that the biases associated with these general classes of machine archi¬tecture are such that they are each best suited to learning a specific, non-equivalent, subset of the features existent in biological sequences. In particular, the fixed-window architectures are suited to learning fixed-length patterns that contain dependencies between elements at fixed positions within the window. The recurrent architectures are suited to the detection of variable length patterns, where the dependencies are positioned flexibly along a region of the sequence.
Through a range of synthetic and applied case studies, this thesis examines the relative abil¬ities of the two classes of architecture to distinguish between sequences that contain biologically important features. The resulting mapping between biological features and classes of machine architecture will permit bioinformatics researchers to identify the general class of machine that is likely to perform well on a classification problem with a given set of features.


 
Citation counts: Google Scholar Search Google Scholar
Created: Thu, 01 May 2008, 15:28:31 EST by Noela Stallard on behalf of Library - Information Access Service