CE807 - Assignment 1 - Interim Practical Text Analytics and Report
CE807 - Assignment 1 - Interim Practical Text Analytics and Report
This paper presents a critical discussion on text classification and hate speech detection. Text
classification has been a widely researched topic in natural language processing, with
applications ranging from document categorization to sentiment analysis. In recent years, the
detection of hate speech has become a pressing issue, given the increase in online hate speech
and its harmful impact. The paper provides an overview of the techniques used for text
classification, including supervised and unsupervised learning methods. It then focuses on the
specific methods used for hate speech detection, including rule-based approaches and machine
learning techniques. The paper also presents a detailed analysis of some of the most prominent
datasets used for hate speech detection and discusses the ethical considerations and challenges in
this field. Finally, the paper highlights the need for further research in this area and presents
some possible future directions for research. Overall, this paper aims to provide a critical
discussion of text classification and hate speech detection and to contribute to the ongoing
research in this area.
4. Summary
Thus, text classification is an important NLP task that has numerous real-world applications.
There are several basic methods used in text classification, each with its own advantages and
disadvantages. Rule-based classification is easy to interpret but requires a high degree of domain
expertise, while Naive Bayes is computationally efficient but may suffer from data sparsity.
SVM can handle high-dimensional data and has a strong theoretical foundation, but it can be
computationally expensive. Deep learning models can achieve state-of-the-art performance but
require large amounts of labeled data and can be computationally expensive to train. The choice
of method depends on the specific problem and available resources.
Various techniques have been proposed for text classification, including supervised,
unsupervised, semi-supervised, rule-based, and deep learning methods. The effectiveness of
these methods depends on the quality and quantity of labeled data, the domain knowledge, and
the complexity of the text classification task.
Various fields have contributed to the development of text classification methods for detecting
offensive language and hate speech, with machine learning and deep learning being among the
most popular and effective approaches. Dynamic dataset generation, feature engineering, and
title-based indexing are also advanced methods that can improve the performance of classifiers.
5. References
[1]. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in
terms of genre and author. Computational linguistics, 26(4), 471-495.
[2]. Moreno, A., & Redondo, T. (2016). Text analytics: the convergence of big data and
artificial intelligence. IJIMAI, 3(6), 57-64.
[3]. Casamayor, A., Godoy, D., & Campo, M. (2010). Identification of non-functional
requirements in textual specifications: A semi-supervised learning approach. Information and
Software Technology, 52(4), 436-445.
[4]. Malmasi, S., & Zampieri, M. (2017). Detecting hate speech in social media. arXiv
preprint arXiv:1712.06427.
[5]. Waseem, Z., & Hovy, D. (2016, June). Hateful symbols or hateful people? predictive
features for hate speech detection on twitter. In Proceedings of the NAACL student research
workshop (pp. 88-93).
[6]. Warner, W., & Hirschberg, J. (2012, June). Detecting hate speech on the world wide web.
In Proceedings of the second workshop on language in social media (pp. 19-26).
[7]. Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2020). Learning from the worst:
Dynamically generated datasets to improve online hate detection. arXiv preprint
arXiv:2012.15761.
[8]. Caselli, T., Basile, V., Mitrović, J., & Granitzer, M. (2020). Hatebert: Retraining bert for
abusive language detection in english. arXiv preprint arXiv:2010.12472.
[9]. Galke, L., Mai, F., Schelten, A., Brunsch, D., & Scherp, A. (2017, December). Using
titles vs. full-text as source for automated semantic document annotation. In Proceedings of the
Knowledge Capture Conference (pp. 1-4).
[10]. Mai, F., Galke, L., & Scherp, A. (2018, May). Using deep learning for title-based
semantic subject indexing to reach competitive performance to full-text. In Proceedings of the
18th ACM/IEEE on Joint Conference on Digital Libraries (pp. 169-178).