This thesis develops a set of tools to classify data that are specifically tailored for the purpose of medical diagnostic applications. In medical diagnostic applications, data obtained from a patient typically consist of a large number of instances or objects that require analysis. It is interesting to note that medical experts analyse such data in a holistic fashion; that is, they scrutinise the instances or objects as a whole in order to arrive at a diagnosis. However, the traditional approach to automating these medical diagnostic procedures has employed a two-step classification process where the classifier analyses every instance or object individually prior to a final classification step. This two-step classification process is prone to two major problems: firstly, the more instances to be analysed, the greater the computational complexity; secondly, the accuracy of the final classification step is highly dependent on the results of the individual instance analysis in the initial step. Therefore, this thesis explores a different approach in which the data are classified as a group in one step. Under this paradigm, not only can two-step classification problems be solved more efficiently, but it is also more faithful to the one-step holistic process that medical experts normally apply. It is shown that prior work on existing automated classification approaches are not directly concerned with classifying a group of data in one step. Rather, the premise of this thesis is that when the prior knowledge that a group of instances or objects in a sample belong to same, but unknown class membership, classification of the group is possible in a single step. This approach is referred to as group-based classification (GBC).
Initially, a GBC technique is developed using a hypothesis-testing framework by converting a multidimensional classification problem into one dimension using an appropriate statistical summary. The one-dimensional data are then classified using a statistical hypothesis test—specifically, an F-test—as a measure of group similarity. In both synthetic and real data sets, the proposed GBC technique outperforms existing two-step classifiers. In fact, based on the empirical study, when the size of the data is large enough, the GBC technique achieves an error rate of zero. Next, another set of GBCs technique is developed by extending the naive Bayes classifier and nearest neighbour classifier (and variants) to demonstrate both one- and two-step GBC techniques. The results for the synthetic and real data sets clearly demonstrate that using one-step GBC techniques can reduce the error rate in comparison to two-steps classifiers. Indeed, the one-step GBC is more effective than the two-step GBC in all data sets tested. We also demonstrate the application of GBC in classifying malignancy-associated changes (MACs) data for cervical cancer screening. The performances of the GBC techniques that are developed early are evaluated against other existing classifiers in terms of accuracy and area under the receiver operating characteristic curve (AUC). An analysis of variance (ANOVA) is then used to test the significance of any differences between the cross-validated estimates of the accuracy and the AUC. The GBC techniques show favourable accuracy and statistically significant improvement in the AUC compared to other classifiers.