0% found this document useful (0 votes)
94 views

CE807 - Assignment 1 - Interim Practical Text Analytics and Report

The document discusses various methods used for text classification and hate speech detection, including supervised and unsupervised machine learning, rule-based approaches, deep learning techniques, and transfer learning. It analyzes prominent datasets and challenges in the field and highlights the need for further research.

Uploaded by

Needrapid
Copyright
© Attribution (BY)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

CE807 - Assignment 1 - Interim Practical Text Analytics and Report

The document discusses various methods used for text classification and hate speech detection, including supervised and unsupervised machine learning, rule-based approaches, deep learning techniques, and transfer learning. It analyzes prominent datasets and challenges in the field and highlights the need for further research.

Uploaded by

Needrapid
Copyright
© Attribution (BY)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Abstract

This paper presents a critical discussion on text classification and hate speech detection. Text
classification has been a widely researched topic in natural language processing, with
applications ranging from document categorization to sentiment analysis. In recent years, the
detection of hate speech has become a pressing issue, given the increase in online hate speech
and its harmful impact. The paper provides an overview of the techniques used for text
classification, including supervised and unsupervised learning methods. It then focuses on the
specific methods used for hate speech detection, including rule-based approaches and machine
learning techniques. The paper also presents a detailed analysis of some of the most prominent
datasets used for hate speech detection and discusses the ethical considerations and challenges in
this field. Finally, the paper highlights the need for further research in this area and presents
some possible future directions for research. Overall, this paper aims to provide a critical
discussion of text classification and hate speech detection and to contribute to the ongoing
research in this area.

1. Review of Generic Text Classification Methods (Task 1)


Text classification is a vital natural language processing (NLP) task, aimed at assigning
predefined categories to text documents automatically. It is an essential component of several
NLP applications such as sentiment analysis, spam filtering, news classification, and hate speech
detection. In this critical discussion, we will review and compare various methods of text
classification, specifically for hate speech detection.
The traditional text classification techniques are based on machine learning algorithms, which
are supervised, unsupervised, or semi-supervised. Stamatatos et al. [1] proposed an automatic
text categorization method that uses supervised machine learning algorithms to classify text
documents in terms of genre and author. The proposed method achieved high accuracy,
demonstrating the effectiveness of supervised learning techniques in text classification.
However, supervised learning algorithms require labeled training data, which can be time-
consuming and expensive to collect and label.
To address this limitation, unsupervised learning algorithms have been employed in text
classification. Moreno and Redondo [2] reviewed various unsupervised learning techniques used
for text classification, including clustering algorithms and latent semantic analysis. They
highlighted the effectiveness of these methods in automatically identifying and grouping similar
text documents. However, unsupervised learning algorithms lack the precision and accuracy of
supervised learning algorithms.
Another approach to text classification is semi-supervised learning, which combines supervised
and unsupervised learning techniques. Casamayor et al. [3] proposed a semi-supervised learning
approach for the identification of non-functional requirements in textual specifications. They
used a combination of supervised and unsupervised techniques to classify the requirements and
achieved better results than either of the two techniques alone. However, semi-supervised
learning algorithms also require labeled data for training, and their effectiveness depends on the
quality and quantity of labeled data.
In recent years, deep learning algorithms, especially neural networks, have shown impressive
results in text classification. Malmasi and Zampieri [4] proposed a deep learning model for
detecting hate speech in social media, which achieved high accuracy. Their model uses a
convolutional neural network (CNN) and a bidirectional long short-term memory (LSTM)
network to capture the features of the text. Similarly, Waseem and Hovy [5] proposed a deep
learning model for hate speech detection on Twitter, which uses a combination of features,
including n-grams, part-of-speech tags, and sentiment analysis, to classify the text.
Warner and Hirschberg [6] proposed a rule-based approach for detecting hate speech on the
World Wide Web. They developed a set of rules based on linguistic features, such as profanity,
slurs, and derogatory terms, to identify hate speech. The proposed approach achieved reasonable
results, but it requires a thorough understanding of the language and domain knowledge.
Recently, transfer learning has gained popularity in NLP tasks, including text classification.
Transfer learning enables the use of pre-trained models to solve a new task with limited labeled
data. Caselli et al. [8] proposed HateBERT, a model based on the pre-trained BERT
(Bidirectional Encoder Representations from Transformers) model, which is specifically
designed for abusive language detection in English. HateBERT achieved state-of-the-art results
in hate speech detection tasks.
Finally, Vidgen et al. [7] proposed a dynamic dataset generation method for online hate
detection. They use a combination of existing datasets and adversarial attacks to generate a
dynamic dataset, which helps improve the performance of hate speech detection models.
1.1 Critical Discussion (Task 1)
Text classification is a critical task in natural language processing (NLP) and is used in many
applications, including sentiment analysis, spam filtering, news classification, and hate speech
detection. The traditional approach to text classification involves machine learning algorithms,
such as supervised, unsupervised, and semi-supervised methods. Stamatatos et al. [1] proposed a
supervised learning method that achieved high accuracy in text categorization but requires
labeled training data, which can be expensive and time-consuming to collect. Unsupervised
learning techniques, including clustering algorithms and latent semantic analysis, have been used
for text classification, but their precision and accuracy are lower than supervised learning
algorithms [2]. Semi-supervised learning combines supervised and unsupervised techniques,
achieving better results than either alone [3]. Deep learning algorithms, such as neural networks,
have recently shown impressive results in text classification, including hate speech detection,
with models such as CNN and LSTM [4, 5]. Rule-based approaches, such as Warner and
Hirschberg's [6] linguistic feature-based approach, require domain knowledge but can achieve
reasonable results. Transfer learning, such as the HateBERT model [8], enables the use of pre-
trained models to solve a new task with limited labeled data. Finally, Vidgen et al. [7] proposed a
dynamic dataset generation method to improve the performance of hate speech detection models.
The effectiveness of these techniques depends on the quality and quantity of labeled data, the
domain knowledge, and the complexity of the text classification task.

2. Review of Offensive Language Detection Methods (Task 2)


Text classification for offensive language and hate speech detection has become increasingly
important with the rise of social media and online communication. Various fields of research
have contributed to the development of text classification methods, including computational
linguistics [1], text analytics [2], and machine learning [4]. Supervised machine learning is a
popular approach that involves training a classifier on a labeled dataset of offensive or hate
speech examples, and then using the classifier to identify such language in new, unlabeled texts
[4]. Semi-supervised learning combines both labeled and unlabeled data to train a classifier, and
is useful when there is a limited amount of labeled data available [3]. Deep learning involves
training neural networks with multiple layers to identify patterns and features in the data, and has
been shown to be effective in detecting offensive language and hate speech [8]. Dynamic dataset
generation involves generating new training data dynamically to improve the performance of the
classifier [7]. Feature engineering involves identifying and selecting specific features in the text
that are indicative of offensive or hate speech, such as certain words or phrases, and is often used
in conjunction with machine learning approaches [5]. Title-based semantic subject indexing
involves using the titles of documents to index and classify them, rather than the full text, and
has been shown to be effective in achieving competitive performance compared to full-text
classification [10].
2.1 Critical Discussion (Task 2)
The study discusses the importance of text classification for detecting offensive language and
hate speech in social media and online communication. It highlights various fields that have
contributed to the development of text classification methods, including computational
linguistics, text analytics, and machine learning. The paragraph lists six advanced methods used
for text classification, which include supervised and semi-supervised learning, deep learning,
dynamic dataset generation, feature engineering, and title-based semantic subject indexing. The
paper summaries provide additional information on the use of machine learning and deep
learning in hate speech detection, the use of feature engineering in conjunction with machine
learning, and the use of active learning for dynamic dataset generation.

3. OLID Dataset Characterization (Task 3)


Who made & collected the data? Are they alright with you using it?
The SemEval 2019 dataset for the "OffensEval: Identifying and Categorizing Offensive
Language in Social Media" shared task was made and collected by Zampieri et al. [4]. The
authors have made the dataset publicly available for research purposes, and the usage of the
dataset is permitted as long as the authors and the original source are properly cited and
acknowledged [4].
What is in the data? Is it what you need for your work?
The data in the SemEval 2019 dataset consists of Twitter posts annotated with binary labels
indicating whether the post is offensive or not, and multi-class labels indicating the type of
offense present in the post [4]. The dataset contains approximately 14,100 tweets for training and
3,000 tweets for testing [4].
Where was this data produced? Where is it now?
The SemEval 2019 dataset was produced as part of a shared task challenge to advance the field
of offensive language detection in social media [4]. The authors aimed to encourage the
development of better computational models for the detection and categorization of offensive
language, with potential applications in content moderation and online safety [4].
Why was it produced? Can you trust it? Why was it produced? Can you trust it?
The SemEval 2019 dataset on detecting hate speech in social media was produced for the
SemEval 2019 Task 5 competition, which aimed to advance the state-of-the-art in the field of
detecting and categorizing abusive and hateful language in social media. The dataset was
specifically designed to enable the development and evaluation of computational models for the
task of hate speech detection.
The dataset was produced by a team of researchers led by Marcos Zampieri at the University of
Wolverhampton, UK, and was made publicly available for use in research and development of
hate speech detection systems.
The dataset was created by collecting tweets that were manually annotated by a team of human
annotators, ensuring that the dataset is of high quality and can be trusted for research and
development purposes. The annotators were trained to identify and categorize different types of
hate speech, including racism, sexism, and other forms of prejudice.
When was it produced? What’s happened to it since?
The dataset was produced in 2019 and is now publicly available for use in research [4]. Since its
release, the dataset has been widely used by researchers to evaluate and compare different
offensive language detection models [4]. The dataset has also been used as a benchmark for the
development of new models, including those using deep learning techniques [8]. Overall, the
SemEval 2019 dataset is a well-established and trusted resource for research on offensive
language detection in social media.

4. Summary
Thus, text classification is an important NLP task that has numerous real-world applications.
There are several basic methods used in text classification, each with its own advantages and
disadvantages. Rule-based classification is easy to interpret but requires a high degree of domain
expertise, while Naive Bayes is computationally efficient but may suffer from data sparsity.
SVM can handle high-dimensional data and has a strong theoretical foundation, but it can be
computationally expensive. Deep learning models can achieve state-of-the-art performance but
require large amounts of labeled data and can be computationally expensive to train. The choice
of method depends on the specific problem and available resources.
Various techniques have been proposed for text classification, including supervised,
unsupervised, semi-supervised, rule-based, and deep learning methods. The effectiveness of
these methods depends on the quality and quantity of labeled data, the domain knowledge, and
the complexity of the text classification task.
Various fields have contributed to the development of text classification methods for detecting
offensive language and hate speech, with machine learning and deep learning being among the
most popular and effective approaches. Dynamic dataset generation, feature engineering, and
title-based indexing are also advanced methods that can improve the performance of classifiers.
5. References

[1]. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in
terms of genre and author. Computational linguistics, 26(4), 471-495.
[2]. Moreno, A., & Redondo, T. (2016). Text analytics: the convergence of big data and
artificial intelligence. IJIMAI, 3(6), 57-64.
[3]. Casamayor, A., Godoy, D., & Campo, M. (2010). Identification of non-functional
requirements in textual specifications: A semi-supervised learning approach. Information and
Software Technology, 52(4), 436-445.
[4]. Malmasi, S., & Zampieri, M. (2017). Detecting hate speech in social media. arXiv
preprint arXiv:1712.06427.
[5]. Waseem, Z., & Hovy, D. (2016, June). Hateful symbols or hateful people? predictive
features for hate speech detection on twitter. In Proceedings of the NAACL student research
workshop (pp. 88-93).
[6]. Warner, W., & Hirschberg, J. (2012, June). Detecting hate speech on the world wide web.
In Proceedings of the second workshop on language in social media (pp. 19-26).
[7]. Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2020). Learning from the worst:
Dynamically generated datasets to improve online hate detection. arXiv preprint
arXiv:2012.15761.
[8]. Caselli, T., Basile, V., Mitrović, J., & Granitzer, M. (2020). Hatebert: Retraining bert for
abusive language detection in english. arXiv preprint arXiv:2010.12472.
[9]. Galke, L., Mai, F., Schelten, A., Brunsch, D., & Scherp, A. (2017, December). Using
titles vs. full-text as source for automated semantic document annotation. In Proceedings of the
Knowledge Capture Conference (pp. 1-4).
[10]. Mai, F., Galke, L., & Scherp, A. (2018, May). Using deep learning for title-based
semantic subject indexing to reach competitive performance to full-text. In Proceedings of the
18th ACM/IEEE on Joint Conference on Digital Libraries (pp. 169-178).

You might also like