Project Report
Project Report
CLUSTER BASED ML
A PROJECT REPORT
CS452 – PROJECT II
By
Batch No: 07
DECEMBER 2022
i
R.V.R. & J.C. COLLEGE OF ENGINEERING
(Autonomous)
CERTIFICATE
ii
ACKNOWLEDGEMENT
We are glad to express our special thanks to Dr. B. Vara Prasad Rao,our project
Guide & In charge, who has inspired us to select this topic, and also for his valuable advices
in preparing this Project work.
We are very much thankful to Dr. K. Ravindra, Principal of R.V.R. & J.C.
College of Engineering, Guntur for providing the supportive Environment.
iii
ABSTRACT
Using Internet technologies, people often share their thoughts, feedback, news and
information with others. Owing to social media sites, speed and ease of contact improved.
User posts their views on public sentiment web sites like Facebook, Instagram, Twitter,
blogs, WhatsApp, Snapchat, LinkedIn, etc. Thousands of posts, millions of tweets and
thousands of letters are posted each day. Twitter is one of them that is now being widely
popular in these social media sites. It ofers an easy and quick way to evaluate the opinions
of consumers on a product or service.
The approach to be used to mark customer’s impressions or opinions of a product
is to establish an sentimental analysis system. Twitter is a microblogging platform where
users can submit feedback to a community of followers in the form of ratings or tweets in
bigdata. A tweet may be defined as positive, negative or neutral depending on the viewpoint
shared. Here we investigate the sentiment of Twitter messages in this paper using the
clustering approach based on machine learning (ML) algorithm. Our tests are carried out in
a qualified and test collection composed of large data from tweets from one lakh of results,
showing our work to determine if a tweet is positive or negative.
iv
TABLE OF CONTENTS
CHAPTER DESCRIPTION PAGE
NO
TITLE i
CERTIFICATE ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES viii
LIST OF ABBREVATIONS ix
1 INTRODUCTION 1
1.1 Background 1
1.2 Problem Statement 1
1.3 Significance of Work 2
1.4 Objectives 2
2 LITERATURE SURVEY 3
2.1 Review of the Project 3
2.2. Limitations of the existing system 6
3 SYSTEM ANALYSIS 7
3.1 Requirement Specification 7
4.3 Dataset 25
v
5 IMPLEMENTATION 26
5.1 Algorithms 26
6 TESTING 28
6.1 Objectives of Testing 28
7 RESULTS 32
8 CONCLUSION 34
9 REFERENCES 35
vi
LIST OF FIGURES
vii
LIST OF TABLES
viii
List of Abbreviations
ix
1.INTRODUCTION
1.1.Background
With an advent of Smart phone and I-phone, today’s generation become very active in online
social networking websites like Twitter, Facebook and Instagram. These websites becomes more
and more powerful tool for the people from different culture, religion, background and interest for
sharing views and thoughts. With an increasing popularity the online activity gradually increases,
so does spread hateful activities. These hateful activities in turn exploits social infrastructure.
Although these OSN websites provide an open space to public to discuss about theirthoughts and
opinion over different background, events, cultures, believes, their life via posting comments, blog
posts, and exchanged of messages. But as this discussion is open and no one is physically involved
so it is nearly impossible to control their emotions in term of the content .
In addition, with different conditions, customs and belief, most of public make use of
aggressive, violent and hateful speech while discussing with other people who do not contribute to
the same backgrounds. Conversely, with the fast development of OSN, exchange of thoughts about
any controversial topic or some big events convert into conflicts which in turn divide people into
two categories, one supports it and one contradicting it .
Leading to these hateful activities in social network and internet, tensions among the group
of individual creates which in turn effects business economy and sometimes online conflicts will
convert into real time conflicts.Considering to this, many websites like twitter, youtube, facebook
etc. will put restriction on using hateful or offensive language. After putting out such restriction still
it is very difficult to control the content. Developing effective method of detecting hateful speech is
one of the foremost for Online social networking website. In this context OSN websites invest
hundreds of millions of euros every year on this task.
1.2.Problem Statement
Hate speech or incitement to hatred is, in general terms, any act of communication that
interiorizes or incites against a person or group, based on characteristics such as race, gender,
ethnicity, nationality, religion, sexual orientatio-n or other discriminating aspect. When a
1
homosexual is offended by his sexual orientation, all homosexuals are offended, just as when a
black man is offended for the simple reason that he is black, all blacks are offended. Hate speech
goes beyond the limits of common sense, since it aims to promote violence, discrimination or
prejudice to the detriment of a group or cluster of people . This is the reason hate speech is viewed
as one of major issue that numerous nations and associations have been standing facing . With the
spread of web, and development of online social network, this issue turns out to be more serious,
since the communication between individuals is not direct, and individuals’ speech become more
violent as they feel safe physically .
Hate speech or incitement to hatred is, in general terms, any act of communication that
interiorizes or incites against a person or group, based on characteristics such as race, gender,
ethnicity, nationality, religion, sexual orientation or other discriminating aspect. When a
homosexual is offended by his sexual orientation, all homosexuals are offended, just as when a
black man is offended for the simple reason that he is black, all blacks are offended. Hate speech
goes beyond the limits of common sense, since it aims to promote violence, discrimination or
prejudice to the detriment of a group or cluster of people . This is the reason hate speech is viewed
as one of major issue that numerous nations and associations have been standing facing . With the
spread of web, and development of online social network, this issue turns out to be more serious,
since the communication between individuals is not direct, and individuals’ speech become more
violent as they feel safe physically .
1.4.Objectives
In recent years large number of researches carried out in way of automation but no one
will work on real time twitter tweets. Our approach performs detection of real time twitter tweets
based on priority based PoS Dictionary tagging , sentiment sensitive priority dictionary tagging,
and direct stated feature extraction pair generation. These features are then divided into three
clusters positive, negative, and neutral. The negative cluster is further divided as “hate”.
“offensive” and clean”.Now we can easily remove hateful tweets posted on any OSN website like
twitter by removing offensive clusters.
2
2.LITERATURE SURVEY
In Hajime Watanabe et al. (2018), perform hate speech detection based on the collected hateful
and offensive expression and classify it in three class “Clean”, “hateful”, and “ofensive”. This
classifcation is done based on pattern writing and feature extraction. Feature extraction further
performs in four diferent ways. Firstly, feature extraction is done through classifying tweets into
positive, negative,or neutral. This positive and negative score of words are extracted from
SentiStrength, a tool that provide sentiment scores to words and sentences of which it is collected
and composed. Secondly, semantic feature extraction is performed in which punctuations are
extracted. They believed that punctuation marks helpful in detecting offensive for hateful speech.
Third, unigram feature extracted from training set and perform PoS tagging of noun, verb,adjective
or adverb and stored all in three different list one for each PoS tagging. If word is present in tweet
then its corresponding feature value is set ‘true’, otherwise it is set to ‘false’. During the extraction
process they extracted total of 1373 words and unigram features of 1373 are defined. Fourth, in
pattern extraction they divides the tweet word into two category— “Sentimental word (SW)”, that
labeled with tagged with PoS of noun, verb, adjective or adverb and “non-sentimental word (NSW)”
that labeled with another PoS tagging. Pattern extracted from tweets in a way that words belongs to
either “SW” or “NSW” are replaced by its corresponding simplified PoS tag. Finally they perform
parameter optimization by providing different values to each parameter. Using these optimized
parameter accuracy, recall, precision and F1-Score of classification is calculated using different
classifier. Overall 2010 tweets are considered for experiments and provide accuracy of 78.4%
accuracy for detecting whether the tweet is hate or not and an accuracy of 87.4% when detecting
whether offensive tweets using binary classification.
In Warner and Julia (2012), Warner et al. adapted hatespeech detection with word sense
disambiguation. They used template based strategy to extract the features. These approach use set
of 1000 paragraph and labeled each word in paragraph as literals, PoS tagging and Brown clusters.
Each template is centered around single based unigram feature extraction and PoS tagging. For
every paragraph all the possible templates were generated and maintained a count ofeach template
based on their positive and negative occurrence. The log-odd are calculated as totalvalue of the
ratio of positive to negative occurrence. Template that do not recognizes as positive or negative is
discarded as log-odds calculation is ratio based. Total 4379 features are processed.
3
These extracted features are then classified into SVM classifier. This work provides accuracy of
94% of the classification using binary classification and F1 Score of 64%. The recall value of this
system is low.
In, Paula and Nunes (2018), describes current scenario, approaches, algorithms and unifying
definition of hate speech and its detection. They analyze hate speech on different OSN websites,
and contexts, based on the analysis they provide unifying definition of hate speech. Additionally
represent several example and classification rules along with the argument that in favor or against
of classification rule. With the conclusion of analysis they states that hate speech should compared
for diverse area including discrimination, cyberbullying, toxicity, abusive language, faming,
radicalization and extremism. They specified that feature extraction of hate speech can be done by
using general text mining algorithm including PoS based approach, sentiment analysis, n-gram
approach, and deep learning. Hate speech feature extraction can also be done by with specific hate
speech feature detection in order to address specified group, language or stereotype. Finally they
showcase opportunities and challenges in this area, lack of tools and platform for automatic hate
speech detection, insufficient comparative studies of existing approaches and lack of learning in
languages excluding English.
In Gitari et al. (2015), the problem of hate speech is summarized into three fields of nationality,
race and religion. Their work focuses on creating classifier for detecting hate speech on social
media platform. The development of hate speech classifier and subjectivity analysis is being
performed by rule-based learning approach. Objective sentence are then separated from subjective
sentence using sentence—level sentiment analysis approach. Lexicon of hate speech isbuild up by
extracting subjective and semantic word features. Lexicons are augmenting with noun pattern
using bootstrapping. They annotated Lexicons of each sentence into three levels: “non-hateful
(NH)”, “strongly hateful (SH)” and “weekly hateful (SH)”. Test the applicationover 500 labeled
paragraphs. Using grammatic pattern and semantic feature run the classification application over
labeled test set and provide F1-score of 65.12%.
In Olney et al. (2009), describes latent semantic analysis (LSA) generalization with the use local
context and n-gram. Latent semantic analysis (LSA) is a technique of vector space which is used
here to depict the sense of words. Usually, LSA carried out in two steps, in first step, the words
are formed with the help of document matrix followed by the singular value decomposition of
the matrix. However, document matrix is formed according to the word dimension and degree
4
approximation of specified document. In this paper they use LSA to form content matrix which is
characterized with feature rather than the word in document matrix.
In Le and Mikolov (2014) address and work out over the detection of hate speech of online user
comment. This is a two step approach in First step, joint modeling of word and comments is carried
out using paragraph2vec. In second step, continuous bag-of-word (CBOW) neural language model
is used represent the words and comments in joint space. after that, embeddings is used to training
in which a binary classifier classify comment into clean comment and hateful comments. The
performance evaluation is carried out large scale data set of user commentscollected from Finance
website of Yahoo. This dataset consist of 56,280 hate and 895,456 clean speech comments. This
work provides 94% accuracy of classification and 63.75% F1 score.
In Chikashi et al. (2016) targeted on abusive language detection of user content generated online.
This approach perform classification task using n-gram features, lexicon features, syntactic features,
pretrained features, linguistic features, “comment2vec” features and “word2vec” features. Based on
feature and classification, user contents are classifed into two class: “Clean” and “Abusive”. This
approach also used the same dataset used in Le and Mikolov (2014). This finance data of Yahoo
consist of 56,280 comments which is labeled as “Abusive” and 895,456 comments are labeled
“clean”. The accuracy reaches to 90% and Recall, Precision and F-score are 79%, 77% and 78%
respectively.
Kwok et al. (2013) perform detection hateful sentence in twitter against black people. In this work
unigram features extraction is employed. The average accuracy of 76% is achieved for binary
classification. Classifier is not trained for bigram feature so for some instance it gives erroneous
result with an error rate of 24%. Focusing on the hate speech detection in the direction of a particular
gender, racial group, or ethnicity are some of few areas over which research is still needed. As this
work collect unigrams feature related to that particular group. Thus, the built-in- dictionary of these
unigrams features cannot be use again for detecting hate speech towards another groups with the
similar efficiency.
Burnap and Williams (2015) used relationship between the word using typed together with bag of
words (BoW) features extraction approach to make a distinction of hate speech expression from
clean ones
5
2.2.Limitations of the existing systems
➢ Unintelligent schemes for data processing(mean/mode substitution)
➢ Missing values replaced with averages
➢ Clinical BP unable to identify white-coat Hypertension
➢ Ignores relevant clinical data during analysis
6
3.SYSTEM ANALYSIS
3.1.Requirement Specification
3.1.1.FUNCTIONAL REQUIREMENTS
Functional requirements define the internal workings of the software: that is, the
technical details, data manipulation and processing and other specific functionality that show
how the use cases are to be satisfied. They are supported by non-functional requirements,
which impose constraints on the design or implementation.
3.1.1.1.Hardware Requirements
3.1.1.1.1 SYSTEM : Intel Core i5-2600
3.1.1.1.2 RAM : 4 GB or more
3.1.1.2.Software Requirements
3.1.1.1.3 O/S : Windows 7/8/10
3.1.1.1.4 Platform: Google Colab
3.1.1.1.5 Language : Python
3.1.2.NON-FUNCTIONAL REQUIREMENTS
Non-functional requirements are requirements which specify criteria that can be used
to judge the operation of a system, rather than specific behaviors. This should be contrasted
with functional requirements that specify specific behavior or functions. Typical non-
functional requirements are reliability, scalability, and cost. Non-functional requirements are
7
often called the ilities of a system. Other terms for non-functional requirements are
"constraints", "quality attributes" and "quality of service requirements".
Reliability: If any exceptions occur during the execution of the software it should be caught
and thereby prevent the system from crashing.
Scalability: The system should be developed in such a way that new modules and
functionalities can be added, thereby facilitating system evolution.
Cost: The cost should be low because a free availability of software package.
UML is an acronym that stands for Unified Modeling Language. Simply put UML is
a modern approach to modeling and documenting software. In fact, it’s one of the most popular
business process modeling techniques.
It is based on diagrammatic representations of software components. As the old proverb
says: “a picture is worth a thousand words”. By using visual representations, we are able to
better understand possible flaws or errors in software or business processes.
The elements are like components which can be associated in different ways to make
a complete UML picture, which is known as diagram. Thus, it is very impor- tant tounderstand
the different diagrams to implement the knowledge in real-life systems.
Any complex system is best understood by making some kind of diagrams or pictures.
These diagrams have a better impact on our understanding. If we look around, we will realize
that the diagrams are not a new concept but it is used widely in different forms in different
industries.
The various UML diagrams are
1.Usecase diagram
2. Activity diagram
3.Sequence diagram
4.Colloboration diagram
5.Object diagram
6.State chart diagram
7.Class diagram
8.Component diagram
9.Deployment diagram
8
3.2.1.Use Case View
3.2.1.1.Identification of Actors
A use case diagram is a graph of actors, a set of use cases enclosed by a system
boundary, communication (participation) associations between the actors and users and
generalization among use cases. The use case model defines the outside (actors) and inside
(use case) of the system’s behavior. Actors are not part of the system. Actors represent anyone
or anything that interacts with (input to or receive output from) the system. Use-case diagrams
can be used during analysis to capture the system requirements and to understand how the
system should work. During the design phase, you can use use-case diagrams to specify the
behavior of the system as implemented. Use case is a sequence of transactions performed by a
system that yields a measurable result of values for a particular actor. The usecases are all the
ways the system may be used.
9
3.2.2.Activity Diagram
An Activity diagram is a variation of a special case of a state machine, in which the states
are activities representing the performance of operations andthe transitions are triggered
by the completion of the operations. The purpose of Activity diagram is to provide a view
of flows and what is going on inside a use case or among several classes. Activity diagrams
contain activities, transitions between the activities, decision points, and synchronization
bars. An activity represents the performance of some behavior in the workflow. In the
UML, activities are represented as rectangles with rounded edges, transitions are drawn as
directed arrows, decision points are shown as diamonds, and synchronization bars are
drawn as thick horizontal or vertical bars as shown in the following. The activity icon
appears as a rectangle with rounded ends with a name and a component for actions.
10
3.2.3.Sequence Diagram
A sequence diagram is an interaction diagram that shows how processes operate with
one another and in what order. It is a construct of a Message Sequence Chart. A sequence
diagram shows object interactions arranged in time sequence. It depicts the objects and
classes involved in the scenario and the sequence of messages exchanged between the
objects needed to carry out the functionality of the scenario. Sequence diagrams are
typically associated with use case realizations in the Logical View of the system under
development. Sequence diagrams are sometimes called as event diagrams.
12
Fig3.4: Class Diagram for the System
State chart diagram cannot be created for every class in the system, it is only for those lass
objects with significant behavior. State transition:
A state transition indicates that an object in the source state will perform certain
specified actions and enter the destination state when a specified event occurs or when certain
13
conditions are satisfied. A state transition is a relationship between two states, two activities,
orbetween an activity and a state.
We can show one or more state transitions from a state as long as each transition is
unique. Transitions originating from a state cannot have the same event, unless there are
conditions on the event.
3.2.6.Component Diagram
14
Fig 3.6: Component Diagram
3.2.7.Deployment Diagram
15
4.SYSTEM DESIGN
4.2.Proposed System
With a given set of tweet comment, our work performs clustering of tweet comment into three
clusters positive, negative, and neutral. The negative cluster is further divided as “Hate”.
“Offensive” and Clean”. Machine learning approach is used to for classification and extract
the feature form labeled and unlabeled dataset.
The data processing agent manages the data set used for training the prediction model
and storing prediction results. It receives data from all the other agents and stores it in a
database for further processing.
The main tasks of the agent are below:
1) Preprocessing: This comprises estimating missing data entries and removal of outliers.
2) Storage: This comprises saving received data securely for further analysis.
3) Communication: This comprises transmitting the data to the relevant agent.
Data Processing Agent is responsible for estimating missing values and removal of outliers
We represent each position of the weighted sequence as a vector that contains all the
symbols of the alphabet and their corresponding probabilities; if a character does not appear
in a specific position then its probability will be zero. As a result, the weighted matrix
17
representation will be used for weighted sequences. At the first stage, input sentence have
been sent to the punctuation removal module. To remove the punctuation we have used
regular expression which removes irrelevant punctuation marks from the sentence. This
module we have used at a beginning of approach because we are using real time twitter
sentences which has hashtag(#), @, !!!,links and other punctuation marks.
E.g. if we have sentences like “!!!!“@selfequeenbri: cause I’m tired of you big bitches
coming for us skinny girls!!””
Then after punctuation removal output sentence would be like “8220selfequeenbri cause Im
tired of you big bitches coming for us skinny girls8221”
4.2.3.Feature extraction
(a) First phase: In this phase we performed sentence tokenizing which divide user’s
reviews in different sentences. This sentences splitting up has been done using
PunktSentenceTokenizer from the nltk.tokenize.punkt module which normally used where we
have huge data as an input and it tokenize the sentence with punctuation mark.
(b) Second phase: In this phase, each sentence have been subdivided into unigram
words, therefore we have used TreebankWordTokenizer:
18
4.2.4.Priority based part of speech dictionary tagging (PBPoSDT)
Here we used priority based dictionary tagging which works on a following priority rules:
Now we perform Part-of-Speech(PoS) tagging in which we assigned tag to each word like N,
V, ADJ, ADV, P, CON, DT, NN,JJ,PRO, NP, INT, mention in Eq. (4)
This tagging performs sequential search from left to write which compare each word with
a dictionary word to tokenize that word based on the categories. We have division for+ve,
−ve, incrementive, decrementive, hate and offensive words.If there is an occurrence of any tag
then it is added to the previous tagging. Word Comparison with each dictionary Di if Wi ∈ Di
where Di = { D1, D2, …, Dn }
e.g. In our case “bitches” word comes under hate division. Therefore its tagging would
be (’bitches’, ’bitches’, [’Hate’, ’NNS’])”. “tired “ word will be considered as a negative word.
So, it will add negative tagging like “(’tired’, ’tired’, [’negative’, ’VBD’])”. Similarly,
’’skinny’’ comes under negative polarity. So its tagging would be (’skinny’, ’skinny’, [’negative’,
’VBP’]).
Here we used machine learning approach for clustering based on sentiment. This approach is
performed using sentence weight calculation and weight score prediction.
19
4.2.8.Sentence weight calculation
After this tagging we have calculated weight of entire sentence. To calculate weight, each
token of above step will be assigned with some weight. Positive is assigned with 1, negative polarity
token has −1, ofensive and hate is assigned with −2 weight-age. Normally, cumulative weight of all
the word tokens is considered as a total weight of a sentence. But in many cases this gives wrong
results. If in a sentence we have words like very, not, little then they will change the weight of a
sentence. Words like very; amazingly, unbelievable gives additional weight to sentence. So for this
type of incrementive word we will double the existing weight of a sentence. Words like barely, little
reduces the impact of sentence. Therefore, we will calculate half of a existing weight. If we have
word like not, then this will inverse the meaning of a sentence. So, for this type of words we have
given a negation to a existing weight of a sentence.
In our case, sentence has bitches, tired, skinny words. So, weight of sentences would be [bitches
− hate − (−2) + tired − Negative − (−1) + skinny – Negative – (−1)] (−2−1−1=−4).
With a calculation of weight we can divide the sentence based on their weight. We have mainly
positive, neutral, Negative cluster. But Negative cluster can be sub divided as Hate and ofensive
cluster.
In our case total score we obtained is −4 for hate word therefore it belongs to hate cluster
20
Figure 4.2.1 : Step wise Results
21
Step 2: Sentence which has directly stated opinion for particular feature that can be detected
through noun-adjective pair.
• If (NN ∈ ∑ P(Wi) or NNP ∈ ∑ P(Wi) or NNS ∈ ∑ P(Wi) or NNPS ∈ ∑ P(Wi))
Each P(Wi ) added to the NNDetect (NND) array.
• If (JJ ∈ ∑ P(Wi) or JJP ∈ ∑ P(Wi) or JJS ∈ ∑ P(Wi) or JJPS ∈ ∑ P(Wi))
Each P(Wi ) added to the ADJDetect (ADJD) array.
22
Figure 4.2.4 :Performance measurements for direct stated priority
based pair generation method
For this type of user summary, we have to calculate the weight of feature opinion pair. This process
has been done using grouping of feature opinion pair as whole one category and summation of
each separate category has been done to calculate the weight. After calculation of frequency,
representation of that weighted summary of each feature has been done. To represent
23
a summary, word cloud representation has been performed for each unique feature. From this
approach we can identify users’ opinion for some particular category
Step 3: For each feature Fi sort (FCi1, FCi2, …, FCin) for every opinion.
Step 4: Word cloud generation for every feature based on their weight. Which show
higher font word higher weight.
24
4.3.DataSet
The data set that we have collected is Twitter sentiment analysis, which is available online with
free access. It comprises of around 1,00,000 tweets. The dataset link is
https://www.kaggle.com/youben/twitter-sentiment-analy sis/data .
5 0 i think mi bf is cheating on
me!!! T_T
6 0 or i just worry too much?
7 1 Juuuuuuuuuuuuuuuuussssst
Chillin!!
5.1.Algorithms
Priority Based Weighted Cluster Prediction is used to estimate the given data intopositive
, negative and neutral clusters clusters based up on the sentiment score calculated from that data.If
sentiment score calculate is less than 0 then it belongs to negative cluster and if sentiment score
is greater than 0 then it belongs to positive cluster and if sentiment score equalto 0 then it belongs
to neutral cluster.
5.2.Experimental Setup
Machine Learning
The machine Learning based text classifiers are a kind of supervised machine learning , where
the classifier needs to be trained on some labeled training data before it can be appplied to actual
classification task. The training data is usually an extracted portion of the original data hand
labeled manually. After suitable training they can be used on the actual test data.The Naives
Bayes is a statistical classifier whereas Support Vector Machine is a kind of vector space
classifier. The statistical text classifier sceme of Naive Bayes(NB) can be adapted to be used for
sentiment classification problem as it can be visualized as a 2-class text classification problem:
in positive and negative classes.Support Vector Machine(SVM) is a kind of vector space model
based classifier which requires that the text documents should be transformed to feature vectors
before they are used for classification. Usually the text documents are transformed to
multidimensional vectors. The entire problem of classification is then classifying every text
document represented as a vector into a particular class. It is a tyoe of large margin
classifier.Here the goal is to find a decision boundary two classes that is maximally far from any
document in the training data.
This approach needs
⚫ A good classifier such as Naive Bayes
⚫ A training set for each class
There are various training sets available on Internet such as Movie Reviews data set,twitter
26
dataset, etc. Class can be Positive ,Negative . For both the classes we need training data sets.
The Naive Bayes classifier is the simplest and most commonly used classifier. Naive Bayes
classification model computes the posterior probability of class,based on the distribution of the
words in the document. The model works with the BOWs feature extraction which ignores the
position of the word in the document . it uses Bayes Theorem to predict the probability that a
given feature set belongs to a particular label.
Python
NLTK
NLTK is a leading platform fir building Python programs to work with human language
data .It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with suite of text processing libraries libraries for classification , tokenization,
stemming,tagging,parsing and semantic reasoning, wrappers for industrial-strength NLP libraries
and an active discussion forum.
NLTK has been called “a wonderful tool for teacing ad working in, computational
linguistics using Python,” and “an amazing library to play with natural language .” NLTK is
suitable for liguistics,engineers students,educators,researchers, and industry users alike. Natural
Language Processing with Python provides a practical introduction to programming for language
processing.Written by the creators of NLTK,it guides the reader through the fundamentals of
writing Python programs,working with corpora, categorizing text, analyzing linguistic structure ,
and more.
27
6.TESTING
6.1.Objective of Testing
Testing is a fault detection technique that tries to create failure and erroneous states in a
planned way. This allows the developer to detect failures in the system before it is released to
the customer.
Note that this definition of testing implies that a successful test is test that identifies faults.
We will use this definition throughout the definition phase. Another often used definition of
testing is that it demonstrates that faults are not present.
Testing can be done in two ways:
• Top down approach.
• Bottom up approach
Top-down approach –
This type of testing starts from upper level modules. Since the detailed activities
usually performed in the lower level routines are not provided stubs are written.
Bottom-up approach –
Testing can be performed starting from smallest and lowest level modules and
proceeding one at a time. For each module in bottom up testing a short program executes the
module and provides the needed data so that the module is asked to perform the way it will
when embedded within the larger system. In this project, bottom-up approach is used where
the lower level modules are tested first and the next ones having much data in them.
6.2.Testing Methodologies
• Unit testing
• Integration testing
• User Acceptance testing
• Output testing
• Validation testing
Unit testing: Unit testing focuses verification effort on the smallest unit of Software design
that is the module. Unit testing exercises specific paths in a module’s control structure to ensure
complete coverage and maximum error detection. This test focuses on each module
individually, ensuring that it functions properly as a unit. Hence, the naming is Unit Testing.
During this testing, each module is tested individually and the module interfaces are verified
28
for the consistency with design specification. All important processing path are tested for the
Integration testing: Integration testing addresses the issues associated with the dual problems
of verification and program construction. After the software has been integrated a set of high
order tests are conducted. The main objective in this testing process is to take unit tested
modules and builds a program structure that has been dictated bydesign.
User Acceptance testing: User Acceptance of a system is the key factor for the success of any
system. The system under consideration is tested for user acceptance by constantly keeping in
touch with the prospective system users at the time of developing and making changes
wherever required. The system developed provides a friendly user interface that can easily be
understood even by a person who is new to the system.
Output testing: After performing the validation testing, the next step is output testing of the
proposed system, since no system could be useful if it does not produce the required output in
the specified format. Asking the users about the format required by them tests the outputs
generated or displayed by the system under consideration. Hence the output format is
considered in 2 ways - one is on screen and another in printed format.
Validation testing: Validation checks are performed on the following fields:
Text field: The text field can contain only the number of characters lesser than or equal to its
size. The text fields are alphanumeric in some tables and alphabetic in other tables. Incorrect
entry always ashes and error message.
Numeric field: The numeric field can contain only numbers from 0 to 9. An entry of any
character ashes an error messages. The individual modules are checked for accuracy and what
it has to perform. Each module is subjected to test run along with sample data. The individually
tested modules are integrated into a single system. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred from the
output. The testing should be planned so that all the requirements are individually tested. A
successful test is one that gives out the defects for the inappropriate data and produces and
output revealing the errors in the system.
29
Preparation of test data: Taking various kinds of test data does the above testing. Preparation
of test data plays a vital role in the system testing. After preparing the test data
the system under study is tested using that test data. While testing the system by using test data
errors are again uncovered and corrected by using above testing steps and corrections are also
noted for future use.
Using Live test data: Live test data are those that are actually extracted from organization
files. After a system is partially constructed, programmers or analysts often ask users to key
in a set of data from their normal activities. Then, the systems person uses this data as a way to
partially test the system. In other instances, programmers or analysts extract a set of live data
from the files and have them entered themselves. It is difficult to obtain live data in sufficient
amounts to conductive extensive testing. And, although it is realistic data that will show how
the system will perform for the typical processing requirement, assuming that the live data
entered are in fact typical, such data generally will not test all combinations or formats that can
enter the system. This bias toward typical values then does not provide a true systemstest
and in fact ignores the cases most likely to cause system failure.
Using artificial test data: Artificial test data are created solely for test purposes, since they
can be generated to test all combinations of formats and values. In other words, the artificial
data, which can quickly be prepared by a data generating utility program in the information
systems department, make possible the testing of all login and control paths through the
program. The most effective test programs use artificial test data generated by persons other
than those who wrote the programs. Often, an in dependent team of testers formulates a testing
plan, using the systems specifications. The package ”Virtual Private Network” has satisfied all
the requirements specified as per software requirement specification and was accepted.
30
6.3.Test Cases
Test Test Test Pre Test Test Data Expected Post Actual Test
Case Scenario Case Conditio Steps Condition Result Status
Id n (P/F)
2 Estimat Estimati Avail- Import the dataset No Error Positive mputation P
ing ng ability of features Tweet of Positive
Positiv Dataset with Estimated Tweet
e positive
Tweet Sentiment
Test Test Test Pre Test Test Data Expected Post Actual Test
Case Scenario Case Conditio Steps Condition Result Status
Id n (P/F)
3 Estimati stimatin Avail- Import dataset No Error Negativ Imputati P
ng g ability of the e Tweet on of
Positive Dataset features Estimat Positive
Tweet with ed Twee
Negative
Sentimen
t
31
7.RESULTS
From the dataset, we have taken 10 different sets of randomized comments for testing
purpose. For performance measurements we have taken measurements of Precision based on
correct identified to the total generated outcomes. Other than that, we have taken recall based on
correct identified to the total input. To calculate the overall performance we have measured F-
Score based on ratio of double of multiplication of precision and recall to the addition of precision
and recall. At the first stage we have calculated the performance of cluster prediction based on
priority. Refer Table 1 for result analysis of weighted cluster prediction which shows the
performance measurements for hate, clean and offensive cluster. Graph shows that we have more
than 90% of an accuracy in all parameters which is better than the existing approach. As mentioned
in the Table 1 in base paper which have used same dataset has 0.907 overall accuracy that is 90%
but in our approach we have accuracy 0.97 i.e. 97%.
32
In a second phase, we have performed the result analysis of direct stated priority based
pair generation method. In which we have achieved overall 97.746% accuracy. In this approach,
we have calculated the accuracies based on total no of comments in each stage with extracted and
correct obtained pairs of those sentences.
Table 7.3 : Performance measurements of direct stated priority based pair generation
method
33
8.CONCLUSION
The Hate speech is a form of thought, speech and social positioning that encourages violence against
different groups in society. It can be verbalized or written and its intention is to discriminate against
people because of their differences, be they race, color, ethnicity, religion, sexual orientation,
disabilities etc. The Hate Speech is based on the hatred in itself of the different and all the prejudices
that result from that feeling. This hate speech nowadays becomes very paper over online social
networking website like “Facebook”, Twitter” and “instagram”. People feel physically feel safe and
secure on this platform while exchanging their thoughts over the discussion forum provided by such
websites. For any controversial topic, sharing of thoughts among people becomes very aggressive.
It was found based on the survey report carried by Europemany times these hate crime are provoked
by such activities. So detection of hate and ofensive space on OSN website were found to be the
demand of research. Earlier many researches were carried out that we had shown in related work
section on this research area, in most cases they use labeled dataset and divide them in different
hateful speech categories with varying categories. In this context we use sentiment sensitive
dictionary based machine learning approach for hate speechdetection on real time twitter tweets in
which real time twitter are data are preprocessed using weighted pattern matching.
For extracting semantic feature, punctuation removal based regular expression used. Two step
tokenization phase extract sentence and unigram features from tweets. Priority based dictionary
tagging is performed for calculating weight of each sentence is calculated and then based on weights
of each in sentence tweets are clustered into Positive, negative, Hate ofensive, and clean speech.
The new feature that is added in our work is Direct Stated Feature Extraction Pair Generation that
provides stated opinion for particular feature based on Noun-Adjective pair. Our approach
represents tag cloud for different scenario of each feature pair generation. Our proposed approach
has been tested on lakh tweets and then performed evaluation and validation over tested data that
provide the average accuracy of 97% of the task using binary clusterization, F-Score is equal to
94%, and recall is 93%. We also provide comparative study of our work with existing approach. In
future work, it is possible to predict or detect hate speech for tweets that are only opinion based in
which object is absent for which opinion is given .
34
9.REFRENCES
Awais M, Hassan S-U, Ahmed A (2019) Leveraging big data for politics: predicting general
election of Pakistan using a novel rigged model. J Ambient Intell Humaniz Comput 1–9
Björn G, Sikdar UK (2017) Using convolutional neural networks to classify hate speech. In:
Proceedings of the frst workshop on abusive language online. Association for Computational
Linguistics, pp 85–90. https://doi.org/10.18653/v1/W17-3013
Chouchani N, Abed M (2020) Enhance sentiment analysis on social networks with social influence
analytics. J Ambient Intell Humaniz Comput 11(1):139–149
Derczynski L, Ritter A, Clark S, Bontcheva K (2013) Twitter part-ofspeech tagging for all:
overcoming sparse and noisy data. In: Proc. int. conf. RANLP, pp 198–206
Edara DC, Vanukuri LP, Sistla V, Kolli VKK (2019) Sentiment analysis and text categorization of
cancer medical records with LSTM. J Ambient Intell Humaniz Comput 1–17
Hajime W, Bouazizi M, Ohtsuki T (2018) Hate speech on Twitter a pragmatic approach to collect
hateful and ofensive expressions and perform hate speech detection
35
Hodeghatta UR (2013) Sentiment analysis of Hollywood movies on Twitter. In: Proc. IEEE/ACM
ASONAM, pp 1401–1404
Homoceanu S, Loster M, Lof C, Balke W-T (2011) Will i like it? Providing product overviews
based on opinion excerpts. In: Proc. IEEE CEC, pp 26–33
King RD, Sutton GM (2013) High times for hate crime: explaining the temporal clustering of hate
motivated ofending. In: Criminology, pp 871–894
Kwok I, Wang Y (2013) Locate the hate—detecting tweets against blacks. In: Proc. AAAI’13, pp
1621–1622
Natasha L (2017) Facebook, google, twitter commit to hate speech action in Germany
Njagi DG, Zuping Z, Damien H, Long J (2015) A lexicon-based approach for hate speech
detection
Olney AM (2009) Generalizing latent semantic analysis. In: IEEE international conference on
semantic computing. IEEE
Paula F, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput
Surv 51(4):30. https://doi. org/10.1145/3232676
Peter JB (2002) A haven for hate: the foreign and domestic implications of protecting internet
hate speech under the frst amendment. S Calif Law Rev 75(6):1493
Sangeetha K, Prabha D (2020) Sentiment analysis of student feedback using multi-head attention
fusion model of word and context embedding for LSTM. J Ambient Intell Humaniz Comput 1–10
36
Singh NK, Tomar DS, Sangaiah AK (2018) Sentiment analysis: a review and comparative analysis
over social media. J Ambient Intell Humaniz Comput 1–21
Soler JM, Cuartero F, Roblizo M (2012) Twitter as a tool for predicting elections results. In: Proc.
IEEE/ACM ASONAM, pp 1194–1200
Warner W, Julia H (2012) Detecting hate speech on the World Wide Web. In: Proceedings of the
workshop on language in social media. Association for Computational Linguistics (ACL), pp 19–
26
Ying C, Zhou Y, Zhu S, Xu H (2012) Detecting ofensive language in social media to protect
adolescent online safety. In: Proceedings of the 2012 ASE/IEEE international conference on social
computing and ASE/IEEE international conference on privacy, security, risk and trust,
SOCIALCOM-PASSAT ’12. Washington, DC. IEEE Computer Society, pp 71–80.
https://doi.org/10.1109/SocialComPASSAT.2012.55
Zeerak W, Hovy D (2016) Hateful symbols or hateful people? Predictive features for hate speech
detection on Twitter. In: Proceedings of the NAACL student research workshop.
Association for Computational Linguistics, pp 88–93. https://doi.org/10.18653/ v1/N16-2013
Zhao Z, Resnick P, Mei Q (2015) Enquiring minds: early detection of rumors in social media from
enquiry posts. In: Proc. int. conf. World Wide Web, pp 1395–1405. https://www.kaggle.com/yoube
n/twitter-sentiment-analysis/dat
37