0% found this document useful (0 votes)
15 views

Fake News Detection Using Machine Learning12 2

Uploaded by

deepthirgowdaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Fake News Detection Using Machine Learning12 2

Uploaded by

deepthirgowdaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

FAKE NEWS DETECTION USING

MACHINE LEARNING
A MINI PROJECT REPORT

Submitted by,
Deepthi R Gowda (1RL22CS026)
Divya P (1RL22CS028)
Gayathri S (1RL22CS036)
Keerthi D N (1RL22CS053)

Under the guidance of,


Dr. Murthy D H R
Associate Professor, Dept. of CSE, RLJIT.

in partial fulfillment for the award of the

degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING


At

R L JALAPPA INSTITUTE OF TECHNOLOGY


DODDABALLAPURA BENGALURU
DECEMBER 2024
I
Department of CSE

CERTIFICATE

This is to certify that the Mini Project report “FAKE NEWS DETECTION USING
MACHINE LEARNING” being submitted by “Deepthi R Gowda (1RL22CS026), Divya P
(1RL22CS028), Gayathri S (1RL22CS036) , Keerthi D N(1RL22CS053)” in partial fulfilment of
requirement for the award of degree of Bachelor of Engineering in Computer Science and
Engineering of the Visvesvaraya Technological University, Belagavi during the Year 2024-
25 is a bonafide work carried out under my supervision.

Dr/Mr/Ms. <SUPERVISOR NAME> Dr. Murthy D H R


DESIGNATION MINI PROJECT COORDINATOR
Associate Professor, Dept. of CSE,
RLJIT.

Dr. P VIJAYAKARTHIK,
Principal.

R L JALAPPA INSTITUTE OF TECHNOLOGY

Department of CSE
II
DECLARATION

We hereby declare that the work, which is being presented in the project

report entitled “FAKE NEWS DETECTION USING MACHINE

LEARNING” in partial fulfilment for the award of Degree of Bachelor of

Engineering in Computer Science and Engineering, is a record of our

own investigations carried under the guidance of SUPERVISOR NAME,

DESIGNATION, R L JALAPPA INSTITUTE OF TECHNOLOGY

DODDABALLAPURA BENGALURU RURAL.

We have not submitted the matter presented in this report anywhere for the

award of any other Degree.

Deepthi R Gowda (1RL22CS026)


Divya P (1RL22CS028)
Gayathri S (1RL22CS036)
Keerthi D N(1RL22CS053)

III
TABLE OF CONTENT

LIST OF ABBREVATIONS VII

LIST OF FIGURES VIII

LIST OF GRAPHS IX

LIST OF TABLES X

ABSTRACT XI

INTRODUCTION 1

Introduction --------------------------------------------------------------------- 1
Natural Language Processing ---------------------------------------- 2
Fake News Detection ------------------------------------------------- 3
Problem Statement ------------------------------------------------------------- 4
Objectives 5
Methodology 6
Dataset 6
Flowchart 10
Algorithm 11

LITERATURE SURVEY 13

SYSTEM DEVELOPMENT 15

System Configuration ---------------------------------------------------------- 15


Data Pre-processing 15
Design of Project ---------------------------------------------------------------18
Sample Code --------------------------------------------------------------------19

RESULT AND EXPERIMENTAL ANALYSES---------------------------------40

Models applied and their result ------------------------------------------------40

CONCLUSIONS 50

IV
Conclusions --------------------------------------------------------------------- 50
Future Scope -------------------------------------------------------------------- 51

REFERENCES 52

APPENDICES 54

V
LIST OF ABBREVIATIONS

SHORT FORM MEANINGS

TFIDF = Term Frequency - Inverse Document Frequency

SVM = Support Vector Machine

DT = Decision Tree

GBC = Gradient Boosting Classifier

LR = Logistic Regression

RFC = Random Forest Classifier

CV = Count Vectorizer

FIG = Figure

VI
LIST OF FIGURES

Fig. 1: Deep Learning vs Machine Learning vs Artificial Intelligence


Fig. 2: Comparison of Fake and Real news
Fig. 3: Flowchart
Fig. 4: Fake.csv and True.csv
Fig. 5: Design of the Project
Fig. 6: Importing Libraries
Fig. 7: Mounting Google Drive
Fig. 8: Fake.csv
Fig. 9: True.csv
Fig. 10: Comparing Fake and True Dataset
Fig. 11: Describing Fake and True Dataset
Fig. 12: Inserting a column “Outcome”
Fig. 13: Removing last 10 rows from both dataset for manual testing
Fig. 14: Merging the manual data frame
Fig. 15: Manual testing dataset
Fig. 16: Merging the main fake and true data frame
Fig. 17: Whitespace Tokenizer
Fig. 18: Checking the columns
Fig. 19: Removing “title”, “subject” and “date” columns
Fig. 20: Randomly Shuffling the data frame
Fig. 21: Count Vectorizer
Fig. 22: Pre-processing task of words
Fig. 23: Train-Test Split
Fig. 24: Importing for Confusion Matrix
Fig. 25: Logistic Regression
Fig. 26: Support Vector Machine
Fig. 27: Decision Tree Classifier
Fig. 28: Gradient Boosting Classifier
Fig. 29: Random Forest Classifier
Fig. 30: Testing
Fig. 32: Support Vector Machine (SVM)
Fig. 33: Confusion Matrix from Support Vector Machine
Fig. 34: Logistic Regression
Fig. 35: Confusion matrix from Logistic Regression
Fig. 36: Decision Tree
Fig. 37: Confusion Matrix from Decision Tree Classification
Fig. 38: Confusion Matrix from Gradient Boosting Classifier
Fig. 39: Confusion Matrix from Random Forest Classifier
Fig. 40: Web Browser Output

VII
LIST OF GRAPHS

Graph 1: Frequent words in Fake news

Graph 2: Frequent words in Real News

Graph 3: Frequent bigrams

Graph 4: Frequency of subject of the news

Graph 5: Frequency of subject of the news

Graph 6: Fake and Real News

Graph 7: Frequency of words in fake news

Graph 8: Comparison of the accuracies of different models

VIII
LIST OF TABLES

Table 1: Classification Report of SVM

Table 2: Classification Report of LR

Table 3: Classification Report from DT

Table 4: Classification Report of GBC

Table 5: Classification Report of RFC

IX
ABSTRACT

Fake News has become one of the major problems in the existing society. Fake
News has high potential to change opinions, facts and can be the most dangerous
weapon in influencing society.

The proposed project uses NLP techniques for detecting the 'fake news', that is,
misleading news stories which come from non-reputable sources. By building a
model based on a K-Means clustering algorithm, the fake news can be detected.
The data science community has responded by taking actions against the problem.
It is impossible to determine whether the news was real or fake accurately. So, the
proposed project uses the datasets that are trained using the count vectorizer method
for the detection of fake news and its accuracy will be tested using machine learning
algorithms.

In this research, we concentrate on how to spot fake news in internet news sources.
We are dedicated in two ways. In order to determine the percentage of correct news
that is phony, we will use multiple datasets of actual and fake news. We provide a
thorough description of the selection, justification, and approval process as well as
a few exploratory analyses on the observable evidence of etymological differences
in false and legitimate news material. In order to create precise false news
identifiers, we focus a lot of learning studies. Additionally, we provide close
examinations of the automatic and manual evidence of bogus news. Python can be
used to spot fake news posted on social media.

X
CHAPTER-1
INTRODUCTION

1.1) Introduction

Machine learning (ML) is the study of the statistical models and methods used by
computers to do certain tasks devoid of explicit instructions and in favour of
patterns and inference. As part of artificial intelligence, it is viewed. Without
explicit instructions, machine learning algorithms construct a mathematical model
using sample data, or "training data," in order to provide predictions or judgements.
Computational statistics, which focuses on computer-aided prediction, and machine
learning have a lot in common. Machine learning may benefit from the ideas,
practices, and fields of application that come from the study of mathematical
optimisations.

The quantity of modifications that the data goes through is referred to as "deep
learning" in this context. The credit assignment path (CAP) depth is significant,
especially for deep learning systems. The series of changes that take place from
input to output make up the CAP. CAPs define the possible causal connections
between input and outcome. For a feed-forward neural network, the depth of the
CAPs is equal to the depth of the network plus one, given that the output layer is
also parameterized. Since a signal can pass through a layer more than once in
recurrent neural networks, the CAP depth may be limitless.

Fake news, to put it simply, is information that is untrue. whether or whether it is a

1
accurate. Fake news contains verifiable erroneous information. Many significant
companies, even government agencies, are working to address issues related to false
news. However, given that millions of articles are produced or purged every minute in
this age, they are neither responsible nor humanely feasible because they rely on
manual human detection. A machine learning algorithm that creates a trustworthy
automated index score or rating for the authenticity of various publications and can
assess whether the news is true or misleading may provide a solution to this problem.

Fig. 1: Deep Learning vs Machine Learning vs Artificial Intelligence

1.1.1) Natural Language Processing (NLP)

The study of how computers interact with human (natural) languages is known as
natural language processing, or NLP, and it is a branch of computer science and
artificial intelligence that focuses on instructing computers to efficiently analyse
massive volumes of natural language data. In the fields of linguistics, computer
science, information engineering, and artificial intelligence, natural language
processing (NLP) studies how computers interact with human (natural) languages.
Its major goal is to instruct computer programmers in how to study and analyse vast
amounts of natural language.

2
1.1.2) Fake News Detection

With the rising use of social media platforms, false news has become a severe
problem in recent years. Finding fake news is a difficult problem that necessitates
the use of several computer techniques, such as data mining, machine learning, and
natural language processing. In this abstract, the current state of false news
detection will be discussed, along with its challenges and potential solutions.
Finally, it will consider how cutting-edge technology like blockchain and artificial
intelligence may be used in the future to improve the efficiency and precision of
fake news detection.

As a result, there is a larger than ever need for accurate and reliable techniques to
distinguish fake news. The field of fake news detection has rapidly evolved as a
result of researchers and engineers developing a number of techniques and tactics
to identify and combat misleading information. These methods include human fact-
checking by educated professionals as well as sophisticated computers that use
machine learning to examine and classify news content. Automated processes are
also a part of them.

It is important to research and create fake news detection, but it is also a challenging
and complex problem. The ability to recognise fake news requires knowledge of
linguistic nuance, social and cultural contexts, and the complex network dynamics
of online communication. Despite these challenges, work has been done to establish
effective methods for spotting false news, and the area is still developing as new
tools and technology are created.

3
1.2) Problem Statement

Both benefits and drawbacks come with reading the news. On the other hand, news
is actively sought for and consumed since it is easily available, inexpensive, and
quickly spread. It makes it possible for "fake news," or negative news with blatantly
inaccurate material, to be widely disseminated.

As a result, research into the detection of bogus news has recently made significant
strides. First off, identifying fake news just on the basis of the content is challenging
and nontrivial since it is purposefully designed to lead people to accept incorrect
information.

1.3) Objective

Our project's primary goal is to determine the veracity of news in order to determine
if it is real or phoney. the development of a machine learning model that would
allow us to recognise bogus information.

It can be difficult and difficult to identify fake news only based on its content since
it is intentionally produced to influence readers to believe false information.

By applying a range of methods and models, machine learning makes it easy to


detect bogus news. Additionally, to examine the relationship between two words,
we will apply deep learning-based NLP.

You may eliminate stop words using this method as well.

4
1.4) Methodology

1.4.1) Dataset

Two datasets are available. a mix of the two. There are 44898 news stories total in
the csv file, which is a sizable quantity. While the true dataset only comprises
21417, the fraudulent dataset has 23481. This data collection is accessible at:

The dataset contains the following attributes:


The following elements are included in a news article: • Id: Special ID for News
Article;
▪ title;
▪ text;
▪ Subject;
▪ It describes the topic of the news.
▪ Date: It provides news's publication date.
▪ The conclusion that the information might not be trustworthy.
0: Untrustworthy or False News
1: Reliable or Accurate News

First of all, the dataset is quite balanced, as we have shown. There are 21417
accurate news items and 23481 false news pieces in it. This is a beneficial feature
of the dataset.

5
It will aid models in making objective judgments.

Fig. 2: Comparison of Fake and Real news

The dataset has undergone some processing, and as was indicated, stop terms have
been included. The most common words in the dataset are "the," "to," "of," "and,"
etc.

The top 20 terms in the sample were as follows before stop words were eliminated:

6
Fake.csv

Graph 1: Frequent words in Fake news


True.csv

Graph 2: Frequent words in Real News

7
The terms "said," "mr," "trump," "new," "people," and "year," which are now the
most popular ones, can provide the models important information.

We also examined the bigrams in the dataset to have a better understanding of the
news story subjects. Before stop words are removed, the topics of the news stories
are not at all clear. As a result, removing stop words makes it simpler to comprehend
the news reports' themes.

The graph below displays the top 20 bigrams from the dataset before stop words
are removed. As one can see, often used phrases like "of the," "in," and "to the" do
not help one comprehend the content of the story.

Graph 3: Frequent bigrams

8
To display the data, we plotted the frequencies of subject of the news:

Graph 4: Frequency of subject of the news

9
1.4.2) Flowchart:

Fig. 3: Flowchart

10
1.4.3) Algorithm for The Proposed System
Step 1: Pre-processing
▪ Load the dataset of news items with their labels, whether they are true or
false;
▪ Clean the text by eliminating punctuation and stopwords;
▪ Divide the dataset into training and testing sets.

Step 2: Count Vectorization


▪ Count Vectorizer from the Sklearn toolkit may be used to transform text
data into numerical data.
▪ Produce a document-term matrix showing the frequency of each word used
in each document.
▪ Fit the Count Vectorizer using the training set, then convert the data.
▪ Utilise the testing set to change the data.

Step 3: TFIDF Vectorization


▪ Utilise the Tfidf Vectorizer in the Sklearn package to turn the text data into
numerical data.
▪ Use the training set to fit the Tfidf Vectorizer and convert the data.
▪ Create a document-term matrix that depicts the significance of each word in
each document.
▪ Utilise the testing set to change the data.
Step 4: Training the Models
▪ Utilise the data that has been modified by Count Vectorizer and Tfidf
Vectorizer to train a variety of models, including Naive Bayes, Logistic
Regression, Support Vector Machines (SVM), Random Forest, etc.
▪ Fit the models using the training set.
▪ Use the testing set to predict the news article labels.

11
▪ Determine each model's accuracy score using the actual and projected
labels.

Step 5: Confusion Matrix


▪ The confusion matrix displays the amount of true positives, true negatives,
false positives, and false negatives for each model, allowing you to assess
each one's performance.
▪ Measurements like accuracy, recall, and F1-score may be calculated using
the confusion matrix.

Step 6: Accuracy
▪ Determine each model's accuracy by comparing its predicted labels to its
actual labels.
▪ The accuracy measures the proportion of news stories that were accurately
identified as being true or false.
▪ Evaluate the accuracy of various models to find which one is most effective
at spotting fake news.

Step 7: Representing the Output in Web Browser using Streamlit


▪ Use the Streamlit Python module to build an interactive web application for
showcasing the outcomes of false news detection models.
▪ Create a user interface that clearly displays the confusion matrices,
accuracy of each model, and other performance indicators.
▪ Provide tools that allow users to submit their own content for categorization
and display the key terms and phrases used to categorise news items, among
other capabilities.

12
CHAPTER-2
LITERATURE SURVEY

A. S. A. Ahmed, A. Abidin, M. A. Maarof, and R. A. Rashid [1] is only a survey


and does not contain any experiments or findings. Instead, the study offers a
thorough analysis of the many false news detection techniques put out in the
literature, as well as their advantages and disadvantages, as well as the datasets
employed for testing. In terms of feature selection, feature extraction, classification
algorithms, and assessment measures, the authors examine and contrast the
methodologies utilised by various research. In the area of false news identification,
they also emphasise the difficulties and potential avenues for further study. The
article makes use of a number of datasets, including those from BuzzFeed, LIAR,
FakeNewsNet, and PolitiFact.

S. Asghar, S. Mahmood, and H. Kamran, "Fake news detection using machine


learning [2] the article also addresses a number of datasets that have been used in
studies on the identification of fake news, including the LIAR dataset, the Fake
News Challenge dataset, and the BuzzFeed News dataset. According to the authors,
ensemble learning-based algorithms had the greatest results on the LIAR dataset,
with accuracy rates of up to 78%. On the BuzzFeed News dataset, on the other
hand, deep learning-based methods perform better, achieving an accuracy of up to
91%.

J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble learning
with context and attention mechanism,"[3] For their experiments, the authors
employ two datasets: the Celebrity dataset and the LIAR dataset. To capture both
local and global aspects of news items, the proposed model combines convolutional

13
neural networks (CNNs) with recurrent neural networks (RNNs). The experimental
findings demonstrate that the suggested model outperforms numerous baseline
models and reaches an accuracy of up to 73.7%, reaching state-of-the-art
performance on both the LIAR and Celebrity datasets.

] M. F. Hossain, M. M. Islam, M. A. H. Khan, and J. J. Jung, "Fake news detection


using hybrid machine learning algorithms," [4] the LIAR dataset, a gold standard
for research on fake news identification, is used by the authors. It consists of
statements that are either labelled as true or false and also include extra labels for
the degree of falsehood. The Support Vector Machine (SVM), Multinomial Naive
Bayes (MNB), and Random Forest (RF) machine learning techniques are used in
the suggested hybrid method. To choose the most pertinent characteristics for each
algorithm, the authors employ a feature selection technique known as Chi-Square.
They then integrate the results of the three algorithms and arrive at a final forecast
using a weighted voting system. According to the experimental findings, the
suggested hybrid technique works better than each individual algorithm and a
number of baseline models, obtaining an accuracy of up to 72.28%.

S. S. Ghosh, A. Mukherjee, and N. Ganguly, "A multi-perspective approach to fake


news detection," [5] in their research, the authors used the FakeNewsNet dataset
together with another one. Word embedding and term frequency-inverse document
frequency (TF-IDF) approaches are used by the authors to extract aspects from the
news articles that are content-based. In order to determine the veracity of the news
stories based on these attributes, the authors utilise a support vector machine (SVM)
classifier. The experimental findings demonstrate that the proposed multi-
perspective strategy outperforms numerous baseline models and achieves state-of-
the-art performance on the FakeNewsNet and BuzzfeedNews datasets, attaining an
accuracy of up to 94.7%.

14
CHAPTER-3
SYSTEM DEVELOPMENT

3.1) System Configuration

Run this project using standard hardware. We utilised an Intel I5 CPU with 8 GB
of RAM, a 2 GB Nvidia graphics processor, and 2 cores that have a frequency of
1.7 GHz and 2.1 GHz, respectively, to complete the project. The test phase, which
follows the training phase and lasts for around 10-15 minutes, allows for predictions
to be made and accuracy to be determined quickly.

3.2) Data Pre-processing

Data Missing Imputation

Missing values in datasets can be a difficulty for some machine learning techniques.
Therefore, any missing values in each column of the input data must be found and
replaced before we model the prediction problem. Missing The use of data
assignment or assignment is made for this.A space (' ') should be used in place of
the null value for each attribute. Use this approach instead of removing tuples
containing null values.

Removal of Stop Words

Stop words like "if," "the," "is," "a," and "an," among others, shouldn't be given
much weight by a machine learning model because they are common English
expressions and don't increase the novelty or believability of any tale. Being present
in the dataset may have an impact on the model's forecast because they are often
used.

15
Removal of Special Characters

The use of special characters in a sentence has no bearing on whether a piece of


news is accurate or not. We do this to eliminate all punctuation from the dataset.
Regular expressions are used to eliminate all punctuation. A random function was
developed to remove special characters, links, extraspace, underlines, etc.

Lemmatization

The word "play" serves as the origin for other words, including "playing" and
"plays." It is possible to carry out a more extensive examination of the term's
frequency by swapping out the term's core word with words in other tenses and
participles. As a result, we substitute that word for any phrase that only has one
source word.

Count Vectorization
For machine learning algorithms to accept the preprocessed text as input, it must
next be encoded as integers or floating-point values. The phrase used to describe
this method is feature extraction (or vectorization).

If a vocabulary word is present in the text data, we will add one to the corresponding
vector's dimension, which will have the same number of dimensions as our
vocabulary. We will add one to the total for each additional instance of that term,
leaving zeros in the spots where we didn't see it even once.

16
TF-IDF Transformation

In order to create a matrix with TF-IDF values for each feature, we utilise the count
vectorized matrix as a transformation.
IDF, or Inverse Document Frequency, or Term Frequency (TF), which is identical
to what we previously saw in the Count Vectorizer

Because some words may prove to be incredibly unimportant, word frequency


alone might not be accurate. Thus, we employ TF-IDF to maintain harmony
between a word's significance and frequency within the text. The acronym TF-IDF
stands for term frequency and inverse document frequency.
.

Fake.csv and True.csv

Fig. 4: Fake.csv and True.csv

17
3.3) Design of Project

Dataset: The first step is to collect or obtain a dataset of news articles, labeled as
"fake" or "real". This dataset will be used to train and evaluate the performance of
different fake news detection models.

Preprocessing: The dataset must now be cleaned up by eliminating any extraneous


or irrelevant data, including stop words, punctuation, and digits. Additionally, the
text may need to be normalised by making all characters lowercase and eliminating
any special characters or symbols.

Count Vectorizer (BOW): The Bag-of-Words (BOW) format can be used to


transform textual data into numerical characteristics after preprocessing the text.
This entails building a matrix where each row represents a news item and each
column represents a distinct term from the dataset. The value in each cell indicates
how often the term appears in the related art.

Train-Test Split: Once we have the BOW matrix, we can split the data into training
and testing sets. The training set will be used to train the fake news detection model,
while the testing set will be used to evaluate the model's performance on new,
unseen data.

Text-to-vectors (TF-IDF): In addition to BOW, we can also express the textual


data using the Term Frequency-Inverse Document Frequency (TF-IDF)
representation. The frequency of the terms in each article as well as their frequency
throughout the whole dataset is taken into consideration in this representation. This
helps to downplay terms that are prevalent across the whole dataset and to
emphasise words that are exclusive to a certain article.

18
Models: After obtaining the numerical features from the text data, several machine
learning methods such as logistic regression, decision trees, or neural networks can
be employed to train a fake news detection model. The objective of the model is to
learn a function that can accurately classify news stories as either "real" or "fake"
based on the derived attributes from the text.

Accuracy and Confusion Matrix: It's crucial to assess the false news detection
model's performance on the testing set after we've trained it. By assessing its
accuracy, precision, recall, and F1 score, we may do this. To see how many true
positives, true negatives, false positives, and false negatives the model produces,
we may also develop a confusion matrix.

Testing: We may use the model to categorise fresh and previously unheard news
pieces as "real" or "fake" after assessing the model's performance. This entails
applying the same feature extraction and preprocessing operations to the fresh data
that we did during training. After that, we can apply the trained model to the
cleaned-up data to provide a categorization label.

Result: Streamlit library of python is used to represent the result in web browser
where user input the news and algorithm tell that the news is “Real” or “Fake”.

Fig 5: Design of the Project

19
3.4) Sample Code

Fig. 6: Importing Libraries

Fig. 7: Mounting Google Drive

20
Fig. 8: Fake.csv

Fig. 9: True.csv

21
Fig. 10: Comparing Fake and True Dataset

Fig. 11: Describing Fake and True Dataset

22
Pre-processing of Dataset

Fig. 12: Inserting a column “Outcome”

Fig. 13: Removing last 10 rows from both dataset for manual testing

23
Fig. 14: Merging the manual data frame

Fig. 15: Manual_testing dataset

24
Fig. 16: Merging the main fake and true dataframe

25
Graph 5: Frequency of subject of the news

26
Graph 6: Fake and Real News

27
Fig. 17: WhitespaceTokenizer

Graph 7: Frequency of words in fake news

28
Fig. 18: Checking the columns

Fig. 19: Removing “title”, “subject” and “date” columns

29
Fig. 20: Randomly Shuffling the data frame

Fig. 21: Count Vectorizer

30
Fig. 22: Pre-processing task of words

Train-Test Split

31
Fig. 23: Train-Test Split

Fig. 24: Importing for Confusion Matrix

32
Models

Fig. 25: Logistic Regression

33
Fig. 26: Support Vector Machine

34
Fig. 27: Decision Tree Classifier

35
Fig. 28: Gradient Boosting Classifier

36
Fig. 29: Random Forest Classifier

37
Graph 8: Comparison of the accuracies of different models

Testing

Fig. 30: Testing

38
Sample Input

Fig. 31: Output

39
CHAPTER-4
RESULTS AND EXPERIMENTAL ANALYSIS

4.1) Models Applied And their Results

Support Vector Machine (SVM)

▪ Classification and regression problems are resolved using Support Vector


Machine, or SVM, one of the most used supervised learning techniques. It is
mostly used, nevertheless, in Machine Learning Classification problems.
▪ SVM chooses the extreme vectors and points that help build the hyperplane.
The foundation of the SVM approach is the support vectors, which are utilised
to represent these extreme situations. Take a look at the image below, where
a decision boundary or hyperplane is used to classify two separate categories:

Fig. 32: Support Vector Machine (SVM) [6]

40
Below are the Results from applying Support Vector Machine model:

Table 1: Classification Report of SVM

Confusion Matrix:

Fig. 33: Confusion Matrix from Support Vector Machine

41
Logistic regression

▪ In binary classification issues, where the goal is to predict one of two


outcomes, logistic regression is a frequently used approach. Through the use
of a sigmoid function, it converts the output of the linear regression into a
probability value between 0 and 1, which can then be used to decide whether
to classify data by applying a threshold.
▪ With applications in many areas, including credit scoring, spam filtering,
and medical diagnosis, this simple yet reliable algorithm may be taught well
on big datasets. However, because it depends on certain presumptions, such
as the linearity and independence of the characteristics, it could not work
well with highly coupled or nonlinear data.

Fig. 34: Logistic Regression [7]

42
Below are the Results from applying Logistic Regression model:

Table 2: Classification Report of LR

Confusion Matrix:

Fig. 35: Confusion matrix from Logistic Regression

43
Decesion Tree Classification

▪ For both binary and multi-class classification tasks, decision tree


classification is a popular machine learning approach. The input data are
recursively divided into subgroups depending on the most instructive
characteristic.
▪ Decision trees can handle category and numerical data and are simple to
understand and use. Additionally, they are resistant to noise and missing
data and are capable of capturing intricate non-linear correlations between
features.

Fig. 36: Decision Tree [8]

44
Below are the Results from applying Decision Tree Classification model:

Table 3: Classification Report from Decision Tree

Confusion Matrix:

Fig 37: Confusion Matrix from Decision Tree Classification

45
Gradient Boosting Classifier

▪ Gradient Boosting Classifier is a powerful algorithm for both classification


and regression problems. It works by combining multiple weak models, such
as decision trees, to create a strong ensemble model.
▪ One of the advantages of Gradient Boosting Classifier is that it can handle
complex non-linear relationships between features and the target variable.
Additionally, it has a built-in mechanism for handling missing data and can
automatically select important features for better accuracy. However, it can
be computationally expensive and prone to overfitting if not tuned properly.

Below are the Results from applying Gradient boosting classifier model:

Table 4: Classification Report of GBC

46
Confusion Matrix:

Fig. 38: Confusion Matrix from Gradient Boosting Classifier

Random Forest Classifier

▪ As the name implies, a Random Forest consists of numerous independent


decision trees that work together as an ensemble. Each tree in the Random
Forest spits out a class prediction, and the classification that recieves the most
votes becomes the prediction of our model.

47
Below are the Results from applying Random Forest Classifier model:

Table 5: Classification Report of RFC

48
Confusion Matrix:

Fig. 39: Confusion Matrix from Random Forest Classifier

Sample Input:

Fig. 40: Web Browser Output

49
CHAPTER-5
CONCLUSIONS

5.1) Conclusions

Considering the accuracy scores, we were able to establish for the various models,
it appears that all of the models are doing a good job of identifying false news items.
The SVM, Decision Tree, and Gradient Boosting classifiers notably achieved a very
high accuracy of 99.5%, although the Random Forest Classifier performed just
slightly lower, at 98.71%.

All things considered, these results suggest that a range of classifiers may be used
with equal success rates and that machine learning techniques may be extremely
successful in spotting bogus news. It's important to keep in mind that accuracy is
only one measure and that the models should be evaluated using multiple metrics
including precision, recall, and F1-score in addition to factors like interpretability,
scalability, and processing requirements. Investigating different feature extraction
and selection methods, classifier types, and ensemble approaches may also be
useful to see whether even better results may be produced.

We utilised the datasets real and fake, each of which had 21417 and 23481 entries,
respectively. We converted text into a numerical model using TF-ID F Vectorizer
and utilised the following models:
Accuracy of 99.31% for support vector machines
Decision Tree: 99.5% precision

Classifier using Gradient Boosting: Accuracy = 99.5%


Accuracy of 98.7% for the random forest classifier

50
5.2) Future Scope

Future research and advancement in the field of false news detection are abundantly
possible. Future efforts to identify bogus news may go in the following directions:

Including more varied and subtle aspects: For the most part, current methods for
detecting false news rely on simple text-based traits like TF-IDF vectors or bag-of-
words. Research in the future could concentrate on more complex and diverse
aspects, such sentiment analysis, network analysis, or multimedia analysis (for
instance, identifying false images or videos).

Creating more interpretable models: Existing methods for spotting fake news
sometimes rely on complex machine learning algorithms that might be difficult to
comprehend. In the future, it would be beneficial to develop more intelligible
models that might provide more information on how people make decisions.

Combining information from other sources: In addition to social media, news


articles, and videos, fake news is regularly spread through other media channels
and platforms. The development of methods that can incorporate data from several
sources may be crucial in the future to improve false news identification.

Adapting to shifting strategies: It will be crucial for fake news detection


technologies to develop alongside the tactics used by those who create and spread
it. For this, the detection methods might need to be regularly reviewed and
improved.

51
APPENDICES

The following phrases and words are frequently used in legitimate news
and can be used to spot fake news:

▪ Sources
▪ Evidence Experts
▪ Research Statistics Facts
▪ Data Quotes
▪ Corroborate Verification
▪ Objective
▪ Impartial
▪ Reliable
▪ Credible
▪ Transparency
▪ Context Timeliness
▪ Accuracy
▪ impartial reporting
▪ several perspectives

52
Certainly, here are some commonly used words and phrases that
may indicate the presence of fake news:

▪ Allegedly
▪ Supposedly
▪ Claims
▪ "Fake news" or "hoax"
▪ Conspiracy
▪ Unverified
▪ Sensational
▪ Emotional
▪ Outrageous
▪ Shocking
▪ Clickbait
▪ Exaggerated
▪ Biased
▪ Partisan
▪ Misleading
▪ Inaccurate
▪ Unsubstantiated
▪ Rumors
▪ Speculation
▪ Opinions presented as facts.

53
REFERENCES

[1] S. A. Ahmed, A. Abidin, M. A. Maarof, and R. A. Rashid, "Fake news


detection: A survey," IEEE Access, vol. 9, pp. 113051-113071, 2021. doi:
10.1109/ACCESS.2021.3104178

[2] S. Asghar, S. Mahmood, and H. Kamran, "Fake news detection using machine
learning: A survey," IEEE Access, vol. 9, pp. 57613-57639, 2021. doi:
10.1109/ACCESS.2021.3075392

[3] J. H. Kim, S. H. Lee, and H. J. Kim, "Fake news detection using ensemble
learning with context and attention mechanism," IEEE Access, vol. 9, pp.
27569- 27579, 2021. doi: 10.1109/ACCESS.2021.3057736

[4] M. F. Hossain, M. M. Islam, M. A. H. Khan, and J. J. Jung, "Fake news detection


using hybrid machine learning algorithms," IEEE Access, vol. 8, pp. 233350-
233364, 2020. doi: 10.1109/ACCESS.2020.3041149

[5] S. S. Ghosh, A. Mukherjee, and N. Ganguly, "A multi-perspective approach to


fake news detection," IEEE Intelligent Systems, vol. 35, no. 5, pp. 31-39, 2020.
doi: 10.1109/MIS.2020.3012

[6] https://www.google.co.in/imgres?imgurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fstatic.javatpoint.c
om%2Ftutorial%2Fmachine-learning%2Fimages%2Flogistic-regression-
inmachinelearning.png&tbnid=LuaHnfur76i8eM&vet=12ahUKEwjFoPGSruD
AhVNnNgFHUjLCl8QMygCegUIARDjAQ..i&imgrefurl=https%3A%2F%2F
www.javatpoint.com%2Flogisticregressioinmachinelearning&docid=makIlDmu
c8naWM&w=500&h=300&itg=1&q=logistic%20regression&ved=2ahUKEwjF
oPGSru D-AhVNnNgFHUjLCl8QMygCegUIARDjAQ

54
[8] https://www.google.co.in/imgres?imgurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fstatic.javatpoint.c
om%2Ftutorial%2Fmachine-learning%2Fimages%2Flogistic-regression-
inmachinelearning.png&tbnid=LuaHnfur76i8eM&vet=12ahUKEwjFoPGSruDAh
VNnNgFHUjLCl8QMygCegUIARDjAQ..i&imgrefurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fwww.j
avatpoint.com%2Flogisticregressioinmachinelearning&docid=makIlDmuc8naW
M&w=500&h=300&itg=1&q=logistic%20regression&ved=2ahUKEwjFoPGSru
D-AhVNnNgFHUjLCl8QMygCegUIARDjAQ

[9] https://www.google.co.in/url?sa=i&url=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fwww.geeksforgeek
s.org%2Fdecision-tree%2F&psig=AOvVaw0sYuRq-TZe0WWhW-
9YQUnl&ust=1683450911500000&source=images&cd=vfe&ved=0CBEQjRxqF
woTCLDwi7qt4P4CFQAAAAAdAAAAABAE

55

You might also like