CE807 - Assignment 1 - Interim Practical Text Analytics and Report

The document discusses various methods used for text classification and hate speech detection, including supervised and unsupervised machine learning, rule-based approaches, deep learning techniques, and transfer learning. It analyzes prominent datasets and challenges in the field and highlights the need for further research.

Uploaded by

Needrapid

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

CE807 - Assignment 1 - Interim Practical Text Analytics and Report

Uploaded by

Needrapid

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Abstract

This paper presents a critical discussion on text classification and hate speech detection. Text
classification has been a widely researched topic in natural language processing, with
applications ranging from document categorization to sentiment analysis. In recent years, the
detection of hate speech has become a pressing issue, given the increase in online hate speech
and its harmful impact. The paper provides an overview of the techniques used for text
classification, including supervised and unsupervised learning methods. It then focuses on the
specific methods used for hate speech detection, including rule-based approaches and machine
learning techniques. The paper also presents a detailed analysis of some of the most prominent
datasets used for hate speech detection and discusses the ethical considerations and challenges in
this field. Finally, the paper highlights the need for further research in this area and presents
some possible future directions for research. Overall, this paper aims to provide a critical
discussion of text classification and hate speech detection and to contribute to the ongoing
research in this area.

1. Review of Generic Text Classification Methods (Task 1)

Text classification is a vital natural language processing (NLP) task, aimed at assigning
predefined categories to text documents automatically. It is an essential component of several
NLP applications such as sentiment analysis, spam filtering, news classification, and hate speech
detection. In this critical discussion, we will review and compare various methods of text
classification, specifically for hate speech detection.
The traditional text classification techniques are based on machine learning algorithms, which
are supervised, unsupervised, or semi-supervised. Stamatatos et al. [1] proposed an automatic
text categorization method that uses supervised machine learning algorithms to classify text
documents in terms of genre and author. The proposed method achieved high accuracy,
demonstrating the effectiveness of supervised learning techniques in text classification.
However, supervised learning algorithms require labeled training data, which can be time-
consuming and expensive to collect and label.
To address this limitation, unsupervised learning algorithms have been employed in text
classification. Moreno and Redondo [2] reviewed various unsupervised learning techniques used
for text classification, including clustering algorithms and latent semantic analysis. They
highlighted the effectiveness of these methods in automatically identifying and grouping similar
text documents. However, unsupervised learning algorithms lack the precision and accuracy of
supervised learning algorithms.
Another approach to text classification is semi-supervised learning, which combines supervised
and unsupervised learning techniques. Casamayor et al. [3] proposed a semi-supervised learning
approach for the identification of non-functional requirements in textual specifications. They
used a combination of supervised and unsupervised techniques to classify the requirements and
achieved better results than either of the two techniques alone. However, semi-supervised
learning algorithms also require labeled data for training, and their effectiveness depends on the
quality and quantity of labeled data.
In recent years, deep learning algorithms, especially neural networks, have shown impressive
results in text classification. Malmasi and Zampieri [4] proposed a deep learning model for
detecting hate speech in social media, which achieved high accuracy. Their model uses a
convolutional neural network (CNN) and a bidirectional long short-term memory (LSTM)
network to capture the features of the text. Similarly, Waseem and Hovy [5] proposed a deep
learning model for hate speech detection on Twitter, which uses a combination of features,
including n-grams, part-of-speech tags, and sentiment analysis, to classify the text.
Warner and Hirschberg [6] proposed a rule-based approach for detecting hate speech on the
World Wide Web. They developed a set of rules based on linguistic features, such as profanity,
slurs, and derogatory terms, to identify hate speech. The proposed approach achieved reasonable
results, but it requires a thorough understanding of the language and domain knowledge.
Recently, transfer learning has gained popularity in NLP tasks, including text classification.
Transfer learning enables the use of pre-trained models to solve a new task with limited labeled
data. Caselli et al. [8] proposed HateBERT, a model based on the pre-trained BERT
(Bidirectional Encoder Representations from Transformers) model, which is specifically
designed for abusive language detection in English. HateBERT achieved state-of-the-art results
in hate speech detection tasks.
Finally, Vidgen et al. [7] proposed a dynamic dataset generation method for online hate
detection. They use a combination of existing datasets and adversarial attacks to generate a
dynamic dataset, which helps improve the performance of hate speech detection models.
1.1 Critical Discussion (Task 1)
Text classification is a critical task in natural language processing (NLP) and is used in many
applications, including sentiment analysis, spam filtering, news classification, and hate speech
detection. The traditional approach to text classification involves machine learning algorithms,
such as supervised, unsupervised, and semi-supervised methods. Stamatatos et al. [1] proposed a
supervised learning method that achieved high accuracy in text categorization but requires
labeled training data, which can be expensive and time-consuming to collect. Unsupervised
learning techniques, including clustering algorithms and latent semantic analysis, have been used
for text classification, but their precision and accuracy are lower than supervised learning
algorithms [2]. Semi-supervised learning combines supervised and unsupervised techniques,
achieving better results than either alone [3]. Deep learning algorithms, such as neural networks,
have recently shown impressive results in text classification, including hate speech detection,
with models such as CNN and LSTM [4, 5]. Rule-based approaches, such as Warner and
Hirschberg's [6] linguistic feature-based approach, require domain knowledge but can achieve
reasonable results. Transfer learning, such as the HateBERT model [8], enables the use of pre-
trained models to solve a new task with limited labeled data. Finally, Vidgen et al. [7] proposed a
dynamic dataset generation method to improve the performance of hate speech detection models.
The effectiveness of these techniques depends on the quality and quantity of labeled data, the
domain knowledge, and the complexity of the text classification task.

2. Review of Offensive Language Detection Methods (Task 2)

Text classification for offensive language and hate speech detection has become increasingly
important with the rise of social media and online communication. Various fields of research
have contributed to the development of text classification methods, including computational
linguistics [1], text analytics [2], and machine learning [4]. Supervised machine learning is a
popular approach that involves training a classifier on a labeled dataset of offensive or hate
speech examples, and then using the classifier to identify such language in new, unlabeled texts
[4]. Semi-supervised learning combines both labeled and unlabeled data to train a classifier, and
is useful when there is a limited amount of labeled data available [3]. Deep learning involves
training neural networks with multiple layers to identify patterns and features in the data, and has
been shown to be effective in detecting offensive language and hate speech [8]. Dynamic dataset
generation involves generating new training data dynamically to improve the performance of the
classifier [7]. Feature engineering involves identifying and selecting specific features in the text
that are indicative of offensive or hate speech, such as certain words or phrases, and is often used
in conjunction with machine learning approaches [5]. Title-based semantic subject indexing
involves using the titles of documents to index and classify them, rather than the full text, and
has been shown to be effective in achieving competitive performance compared to full-text
classification [10].
2.1 Critical Discussion (Task 2)
The study discusses the importance of text classification for detecting offensive language and
hate speech in social media and online communication. It highlights various fields that have
contributed to the development of text classification methods, including computational
linguistics, text analytics, and machine learning. The paragraph lists six advanced methods used
for text classification, which include supervised and semi-supervised learning, deep learning,
dynamic dataset generation, feature engineering, and title-based semantic subject indexing. The
paper summaries provide additional information on the use of machine learning and deep
learning in hate speech detection, the use of feature engineering in conjunction with machine
learning, and the use of active learning for dynamic dataset generation.

3. OLID Dataset Characterization (Task 3)

Who made & collected the data? Are they alright with you using it?
The SemEval 2019 dataset for the "OffensEval: Identifying and Categorizing Offensive
Language in Social Media" shared task was made and collected by Zampieri et al. [4]. The
authors have made the dataset publicly available for research purposes, and the usage of the
dataset is permitted as long as the authors and the original source are properly cited and
acknowledged [4].
What is in the data? Is it what you need for your work?
The data in the SemEval 2019 dataset consists of Twitter posts annotated with binary labels
indicating whether the post is offensive or not, and multi-class labels indicating the type of
offense present in the post [4]. The dataset contains approximately 14,100 tweets for training and
3,000 tweets for testing [4].
Where was this data produced? Where is it now?
The SemEval 2019 dataset was produced as part of a shared task challenge to advance the field
of offensive language detection in social media [4]. The authors aimed to encourage the
development of better computational models for the detection and categorization of offensive
language, with potential applications in content moderation and online safety [4].
Why was it produced? Can you trust it? Why was it produced? Can you trust it?
The SemEval 2019 dataset on detecting hate speech in social media was produced for the
SemEval 2019 Task 5 competition, which aimed to advance the state-of-the-art in the field of
detecting and categorizing abusive and hateful language in social media. The dataset was
specifically designed to enable the development and evaluation of computational models for the
task of hate speech detection.
The dataset was produced by a team of researchers led by Marcos Zampieri at the University of
Wolverhampton, UK, and was made publicly available for use in research and development of
hate speech detection systems.
The dataset was created by collecting tweets that were manually annotated by a team of human
annotators, ensuring that the dataset is of high quality and can be trusted for research and
development purposes. The annotators were trained to identify and categorize different types of
hate speech, including racism, sexism, and other forms of prejudice.
When was it produced? What’s happened to it since?
The dataset was produced in 2019 and is now publicly available for use in research [4]. Since its
release, the dataset has been widely used by researchers to evaluate and compare different
offensive language detection models [4]. The dataset has also been used as a benchmark for the
development of new models, including those using deep learning techniques [8]. Overall, the
SemEval 2019 dataset is a well-established and trusted resource for research on offensive
language detection in social media.

4. Summary
Thus, text classification is an important NLP task that has numerous real-world applications.
There are several basic methods used in text classification, each with its own advantages and
disadvantages. Rule-based classification is easy to interpret but requires a high degree of domain
expertise, while Naive Bayes is computationally efficient but may suffer from data sparsity.
SVM can handle high-dimensional data and has a strong theoretical foundation, but it can be
computationally expensive. Deep learning models can achieve state-of-the-art performance but
require large amounts of labeled data and can be computationally expensive to train. The choice
of method depends on the specific problem and available resources.
Various techniques have been proposed for text classification, including supervised,
unsupervised, semi-supervised, rule-based, and deep learning methods. The effectiveness of
these methods depends on the quality and quantity of labeled data, the domain knowledge, and
the complexity of the text classification task.
Various fields have contributed to the development of text classification methods for detecting
offensive language and hate speech, with machine learning and deep learning being among the
most popular and effective approaches. Dynamic dataset generation, feature engineering, and
title-based indexing are also advanced methods that can improve the performance of classifiers.
5. References

[1]. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in
terms of genre and author. Computational linguistics, 26(4), 471-495.
[2]. Moreno, A., & Redondo, T. (2016). Text analytics: the convergence of big data and
artificial intelligence. IJIMAI, 3(6), 57-64.
[3]. Casamayor, A., Godoy, D., & Campo, M. (2010). Identification of non-functional
requirements in textual specifications: A semi-supervised learning approach. Information and
Software Technology, 52(4), 436-445.
[4]. Malmasi, S., & Zampieri, M. (2017). Detecting hate speech in social media. arXiv
preprint arXiv:1712.06427.
[5]. Waseem, Z., & Hovy, D. (2016, June). Hateful symbols or hateful people? predictive
features for hate speech detection on twitter. In Proceedings of the NAACL student research
workshop (pp. 88-93).
[6]. Warner, W., & Hirschberg, J. (2012, June). Detecting hate speech on the world wide web.
In Proceedings of the second workshop on language in social media (pp. 19-26).
[7]. Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2020). Learning from the worst:
Dynamically generated datasets to improve online hate detection. arXiv preprint
arXiv:2012.15761.
[8]. Caselli, T., Basile, V., Mitrović, J., & Granitzer, M. (2020). Hatebert: Retraining bert for
abusive language detection in english. arXiv preprint arXiv:2010.12472.
[9]. Galke, L., Mai, F., Schelten, A., Brunsch, D., & Scherp, A. (2017, December). Using
titles vs. full-text as source for automated semantic document annotation. In Proceedings of the
Knowledge Capture Conference (pp. 1-4).
[10]. Mai, F., Galke, L., & Scherp, A. (2018, May). Using deep learning for title-based
semantic subject indexing to reach competitive performance to full-text. In Proceedings of the
18th ACM/IEEE on Joint Conference on Digital Libraries (pp. 169-178).

Big Data Analytics Seminar Report 2020-21
71% (21)
Big Data Analytics Seminar Report 2020-21
21 pages
Data Intelligence and Analytics A Bibliometric Analysis of Human-Artificial Intelligence in Public Sector Decision-Making Effectiveness
No ratings yet
Data Intelligence and Analytics A Bibliometric Analysis of Human-Artificial Intelligence in Public Sector Decision-Making Effectiveness
17 pages
Deep Learning Based Fusion Approach For Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach For Hate Speech Detection
7 pages
Final Year
No ratings yet
Final Year
25 pages
Hate Speech Detection in Twitter Using Natural Language Processing
No ratings yet
Hate Speech Detection in Twitter Using Natural Language Processing
7 pages
Chapter 1: Introduction: 1.1. General
No ratings yet
Chapter 1: Introduction: 1.1. General
49 pages
Countering Hate Speech On Social Media
No ratings yet
Countering Hate Speech On Social Media
2 pages
Machine Learning Based Automatic Hate Speech Recognition System
No ratings yet
Machine Learning Based Automatic Hate Speech Recognition System
4 pages
A Voting Enabled Predictive Approach For Hate Speech Detection
No ratings yet
A Voting Enabled Predictive Approach For Hate Speech Detection
5 pages
Deep Learning for hate speech detection: Compararive Study
No ratings yet
Deep Learning for hate speech detection: Compararive Study
18 pages
Investigating Deep Learning Approaches For Hate
No ratings yet
Investigating Deep Learning Approaches For Hate
12 pages
Systematic Literature Review of Hate Speech Detection With Text Mining
No ratings yet
Systematic Literature Review of Hate Speech Detection With Text Mining
6 pages
Related Work
No ratings yet
Related Work
12 pages
Ensemble_Text_Classification_with_TF-IDF_Vectorization_for_Hate_Speech_Detection_in_Social_Media
No ratings yet
Ensemble_Text_Classification_with_TF-IDF_Vectorization_for_Hate_Speech_Detection_in_Social_Media
7 pages
Navigating The Dark Web of Hate: Supervised Machine Learning Paradigm and NLP For Detecting Online Hate Speeches
No ratings yet
Navigating The Dark Web of Hate: Supervised Machine Learning Paradigm and NLP For Detecting Online Hate Speeches
8 pages
DETECTION OF HATE BASED POLITICAL SPEECH
No ratings yet
DETECTION OF HATE BASED POLITICAL SPEECH
5 pages
Deep Learning Based Fusion Approach for Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach for Hate Speech Detection
8 pages
A296 D Stamped
No ratings yet
A296 D Stamped
4 pages
1 Identification of Hate Speech in Social Media
No ratings yet
1 Identification of Hate Speech in Social Media
6 pages
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
No ratings yet
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
16 pages
15056-Article Text-44992-2-10-20210906
No ratings yet
15056-Article Text-44992-2-10-20210906
15 pages
A Survey On Hate Speech Detection Using Natural Language Processing
No ratings yet
A Survey On Hate Speech Detection Using Natural Language Processing
10 pages
A Multilingual Evaluation For Online Hate Speech Detection
No ratings yet
A Multilingual Evaluation For Online Hate Speech Detection
22 pages
paper 12
No ratings yet
paper 12
11 pages
7473-Article Text-10855-1-10-20200925
No ratings yet
7473-Article Text-10855-1-10-20200925
4 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
Performance of Text Classification Methods in Detection of Hate Speech in Media
No ratings yet
Performance of Text Classification Methods in Detection of Hate Speech in Media
7 pages
Hate Speech Detection in Online Social Media
No ratings yet
Hate Speech Detection in Online Social Media
12 pages
Semester Project Report by Qaiser
No ratings yet
Semester Project Report by Qaiser
5 pages
Hate Speech, Offensive Language Detection and Blocking On Social Media Platform Using Feature Engineering Techniques and Machine Learning Algorithms A Comparative Study
No ratings yet
Hate Speech, Offensive Language Detection and Blocking On Social Media Platform Using Feature Engineering Techniques and Machine Learning Algorithms A Comparative Study
16 pages
Hate Speech Detection Using Machine Learning2
No ratings yet
Hate Speech Detection Using Machine Learning2
4 pages
RP 3
No ratings yet
RP 3
4 pages
RP 5
No ratings yet
RP 5
7 pages
Overview of The HASOC Subtrack at FIRE 2022 Identification of Conversational Hate-Speech in Hindi-English Code-Mixed and German Language-T7-1
No ratings yet
Overview of The HASOC Subtrack at FIRE 2022 Identification of Conversational Hate-Speech in Hindi-English Code-Mixed and German Language-T7-1
14 pages
Final Report Edit
No ratings yet
Final Report Edit
26 pages
Hate Speech Detection - Challenges and Solutions - PLOS ONE
No ratings yet
Hate Speech Detection - Challenges and Solutions - PLOS ONE
9 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
11 pages
Defence University College of Engineering: M-Tech Thesis Progress Report
No ratings yet
Defence University College of Engineering: M-Tech Thesis Progress Report
15 pages
12 V May 2024
No ratings yet
12 V May 2024
8 pages
Marathi Hate Speech Detection IEEE Paper
No ratings yet
Marathi Hate Speech Detection IEEE Paper
5 pages
Twitter Hate Speech Detection
No ratings yet
Twitter Hate Speech Detection
2 pages
2020.lrec-1.838
No ratings yet
2020.lrec-1.838
9 pages
Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail On Twitter
No ratings yet
Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail On Twitter
21 pages
paper-biere
No ratings yet
paper-biere
31 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Integrating_Handcrafted_Features_with_Machine_Lear
No ratings yet
Integrating_Handcrafted_Features_with_Machine_Lear
13 pages
FDIA 2023 Paper 4
No ratings yet
FDIA 2023 Paper 4
12 pages
Gitari - A Lexicon-Based Approach For Hate Speech Detection
0% (1)
Gitari - A Lexicon-Based Approach For Hate Speech Detection
16 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
SSRN Id4389914
No ratings yet
SSRN Id4389914
12 pages
RP 1
No ratings yet
RP 1
7 pages
TMP 2001326023
No ratings yet
TMP 2001326023
22 pages
A Survey On Automatic Detection of Hate Speech in Text
No ratings yet
A Survey On Automatic Detection of Hate Speech in Text
30 pages
Hate Speech Detection Using Lstm and NLp Sushan Pratihar 3 Page
No ratings yet
Hate Speech Detection Using Lstm and NLp Sushan Pratihar 3 Page
13 pages
Text_Based_Hate-Speech_Analysis
No ratings yet
Text_Based_Hate-Speech_Analysis
8 pages
Journal Pone 0305657
No ratings yet
Journal Pone 0305657
24 pages
RP 4
No ratings yet
RP 4
7 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
No ratings yet
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
8 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Test Bank - Business Intelligence - Grant Mai
No ratings yet
Test Bank - Business Intelligence - Grant Mai
61 pages
Data Mining SAS
100% (4)
Data Mining SAS
484 pages
50 Analytics Projects!
No ratings yet
50 Analytics Projects!
52 pages
Taxonomy Development Advice
No ratings yet
Taxonomy Development Advice
3 pages
Resume
No ratings yet
Resume
3 pages
Case Studies 2024 - 2025 ODD SEM
No ratings yet
Case Studies 2024 - 2025 ODD SEM
61 pages
MIS58846-2-EN-StudentGuide
No ratings yet
MIS58846-2-EN-StudentGuide
40 pages
Gujarat Technological University
100% (1)
Gujarat Technological University
3 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
Computers in Industry: Fabian Gampfer, Andreas Jürgens, Markus Müller, Rüdiger Buchkremer
No ratings yet
Computers in Industry: Fabian Gampfer, Andreas Jürgens, Markus Müller, Rüdiger Buchkremer
15 pages
Findings On Paper 23242
No ratings yet
Findings On Paper 23242
7 pages
SPSS Modeler Level 2 Quiz Attempt Review1
No ratings yet
SPSS Modeler Level 2 Quiz Attempt Review1
13 pages
Chen, 2023
No ratings yet
Chen, 2023
18 pages
HR Analytics Brochure
No ratings yet
HR Analytics Brochure
9 pages
Applebee's, Travelocity, and Others Data Mining For Business Decisions
0% (1)
Applebee's, Travelocity, and Others Data Mining For Business Decisions
2 pages
Text Mining Tools On The Internet
No ratings yet
Text Mining Tools On The Internet
75 pages
Biology and Data Interpretation Techniques Concepts
No ratings yet
Biology and Data Interpretation Techniques Concepts
2 pages
TxtAnalytics PDF
No ratings yet
TxtAnalytics PDF
106 pages
Discovering Data Science With SAS: Selected Topics
No ratings yet
Discovering Data Science With SAS: Selected Topics
170 pages
PATIL - Data Scientist
No ratings yet
PATIL - Data Scientist
6 pages
International Conference On Advanced Computer Science and Information System 2012
No ratings yet
International Conference On Advanced Computer Science and Information System 2012
22 pages
GC- IA1- PPT
No ratings yet
GC- IA1- PPT
11 pages
Sentiment analysis SRS
No ratings yet
Sentiment analysis SRS
9 pages
The Appointment of Clarita Carlos As Philippine President's National Security Adviser: A Sentiment Analysis
No ratings yet
The Appointment of Clarita Carlos As Philippine President's National Security Adviser: A Sentiment Analysis
12 pages
Business Analytics
100% (1)
Business Analytics
6 pages
B Tech (CSE) Main Project Review I Schedule I On 11-02-12
No ratings yet
B Tech (CSE) Main Project Review I Schedule I On 11-02-12
36 pages
Text Mining Problem Statement
No ratings yet
Text Mining Problem Statement
74 pages