Cyberbullying is widely regarded as a serious social menace which affects millions of people around the world, especially teenagers and adolescents. The purpose of this research is to investigate the data-mining methods, to identify the indicators of cyberbullying within written content, and to propose and verify new models to effectively and efficiently detect emerging cyberbullying activities in social networks.
Cyberbullying has been noticed barely by researchers from the point of automatic detection as it is relatively a new phenomenon. The current studies on cyberbullying detection are largely constrained by a static nature. These are summarised into three categories: (i) manual labelling; (ii) keyword-based or rule-based approaches; and (iii) survey based approaches. However, due to the scarcity of labelled training data, most of the existing approaches remain impractical in the realworld situation. In this dissertation, we aim to (i) identify subsistent problems and challenges in cyberbullying detection in streaming text; (ii) develop effective solutions to address the problems; and (iii) evaluate the proposed approaches in real-world streaming datasets.
We propose a hybrid approach for cyberbullying detection with the identification of all involved users. The first phase aims to accurately detect cyberbullying messages in social networks. We explore the probabilistic feature selection method, namely Probabilistic Latent Semantic Analysis (PLSA) to leverage latent cyberbullying-specific features. In particular, groups of word-feature generated under the predefined topic (cyberbullying) are selected. The next phase aims to analyse social networks to identify predators and victims through measuring the number of cyberbullying messages sent and/or received. A graph model is used to represent the interactions among the users. A ranking algorithm is employed to detect the influential predators and the offended victims. The experimental results indicate that the proposed method is able to categorise the most influential 'predators' and most offended 'victims' effectively. Further, the proposed Graph Model provides insight into users' activity as predators or victims, and conclusions are drawn if the identified case can be classified as cyber-aggression, cyberbullying, or online harassment.
Further, we investigate the problem based on semi-supervised learning. Current studies on cyberbullying detection mainly assume that the streaming text can be fully labelled. We propose a session-based framework for automatic detection of cyberbullying from the substantial amount of unlabelled streaming text. Given that the streaming data from social networks arrives in large volumes at the server system, we incorporate an ensemble of one-class classifiers in the session-based framework. The proposed framework addresses the real world scenario, where only a small set of positive instances are available for initial training. Our main contribution is to automatically detect cyberbullying in a real world situation, where labelled data is not readily available. We also study the automatic extraction and enlargement of training datasets from the given unlabelled streaming text with respect to a set of given keywords. Experimental results show that the proposed approach is effective for the automatic detection of cyberbullying on social networks.
Next, we move one step further, as we investigate how to automatically and precisely define cyberbullying within the written contents for detection purposes. We use questionnaire survey as an analysis tool, and study the subtle differences between the motives of the sender of messages related to cyberbullying, and how messages are perceived by the recipient. Three categories of cyberbullying – 'direct', 'indirect' and 'misinterpreted forms' are introduced. Then, we validate the key indicators which are associated with these three forms of cyberbullying, based on the users' online experiences and the frequency with which they received and/or sent cyberbullying messages.
Finally, we propose a cyberbullying detection model with the substantial amounts of unlabelled streaming data to be dealt with, when only a very small set of positive and negative instances are available for training. An augmented training method is proposed, based on the confidence voting function to extract and to enlarge the training set. We also propose cyberbullying detection models by generating enriched feature sets, such as linguistic features, users' information, and keywords search. Further, streaming text generated by social networks is highly uncertain and unstable, therefore, evolving fuzzy SVM approach is proposed. To handle complex and multidimensional data stream generated in social networks, higher membership value is given to the input instances which have a higher influence on the decision surface. The evaluation conducted on different experimental scenarios shows the superiority of the proposed approaches against all the other baseline methods.