“Accent” is the pattern of speech pronunciation by which one can identify a person’s linguistic, social or cultural background. It is an important source of inter-speaker variability and a particular problem for automated speech recognition. The aim of the study was to investigate a new computational approach to accent classification which did not require phonemic segmentation or the identification of phonemes as input, and which could therefore be used as a simple, effective accent classifier.
Through a series of structured experiments this study investigated the effectiveness of Support Vector Machines (SVMs) for speech accent classification using time-based units rather than linguistically-informed ones, and compared it to the accuracy of other machine learning methods, as well as the ability of humans to classify speech according to accent. A corpus of read-speech was collected in two accents of English (Arabic and “Indian”) and used as the main datasource for the experiments. Mel-frequency cepstral coefficients were extracted from the speech samples and combined into larger units of 10 to 150ms duration, which then formed the input data for the various machine learning systems. Support Vector Machines were found to classify the samples with up to 97.5% accuracy with very high precision and recall, using samples of between 1 and 4 seconds of speech. This compared favourably with a human listener study where subjects were able to distinguish between the two accent groups with an average of 92.5% accuracy in approximately 8 seconds. Repeating the SVM experiments on a different corpus resulted in a best classification accuracy of 84.6%. Experiments using a decision tree learner and a rule-based classifier on the original corpus gave a best accuracy of 95% but results over the range of conditions were much more variable than those using the SVM. Rule extraction was performed in order to help explain the results and better inform the design of the system.
The new approach was therefore shown to be effective for accent classification, and a plan for its role within various other larger speech-related contexts was developed.