0% found this document useful (0 votes)
27 views

D13_Project Report

The project report titled 'Fake News Detection Using Machine Learning' outlines a study conducted by students at Siksha 'O' Anusandhan University, focusing on the challenges of identifying fake news in the digital age. It explores various machine learning algorithms, particularly emphasizing the Decision Tree algorithm for its accuracy in detection. The report includes acknowledgments, individual contributions, and a thorough examination of existing systems, methodologies, and datasets used for the project.

Uploaded by

ipsit9009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

D13_Project Report

The project report titled 'Fake News Detection Using Machine Learning' outlines a study conducted by students at Siksha 'O' Anusandhan University, focusing on the challenges of identifying fake news in the digital age. It explores various machine learning algorithms, particularly emphasizing the Decision Tree algorithm for its accuracy in detection. The report includes acknowledgments, individual contributions, and a thorough examination of existing systems, methodologies, and datasets used for the project.

Uploaded by

ipsit9009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

FAKE NEWS DETECTION USING

MACHINE LEARNING
A Project Report

Submitted by:

Ashutosh Kumar (2041011113)


Aditi Rath (2041018064)
Ashutosh Sarangi (2041019145)
Indrajit Das (2041004164)

in partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Faculty of Engineering and Technology, Institute of Technical Education and Research
SIKSHA ‘O’ ANUSANDHAN (DEEMED TO BE) UNIVERSITY
Bhubaneswar, Odisha, India
(June 2024)
CERTIFICATE

This is to certify that the project report titled “FAKE NEWS DETECTION USING
MACHINE LEARNING” being submitted by Ashutosh Kumar, Aditi Rath, Ashutosh
Sarangi, Indrajit Das of section ‘D’ to the Institute of Technical Education and Research,
Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar for the partial fulfillment
for the degree of Bachelor of Technology in Computer Science and Engineering is a record of
original confide work carried out by them under my/our supervision and guidance. The project
work, in my/our opinion, has reached the requisite standard fulfilling the requirements for the
degree of Bachelor of Technology.

The results contained in this project work have not been submitted in part or full to any other
University or Institute for the award of any degree or diploma.

(Name and signature of the Project Supervisor)


Department of Computer Science and Engineering
Faculty of Engineering and Technology;
Institute of Technical Education and Research;
Siksha ‘O’ Anusandhan (Deemed to be) University

ii
ACKNOWLEDGMENT

We would like to thank Dr. Prativa Das, our project supervisor, from the bottom of our hearts
for all of his help and support during our group project. Her knowledge, perceptions, and
priceless comments have been extremely helpful in forming our project and guaranteeing its
triumphant conclusion. Her patience, attention, and dedication to our education and
development are deeply appreciated.

We express our gratitude to Siksha "O" Anusandhan (Deemed to be University) for providing
the facilities, resources, and infrastructure that were critical to our joint project's success. Our
capacity to conduct research, develop, and collaborate is a result of the institution's
commitment to fostering an atmosphere that promotes academic success and inquiry.

We also acknowledge and express our gratitude to the other group members for their efforts,
cooperation, and project-related contributions. Their varied backgrounds, viewpoints, and
commitment have greatly aided in our project's overall success. Our capacity to collaborate as
a team and overcome challenges has enabled us to achieve our objectives.

Finally, we'd want to thank everyone and every organization that has contributed to our
collaborative effort through conversations, criticism, or other forms of aid. Your help and
encouragement have been critical to our growth and achievement of our purpose. We have
developed tremendously as a consequence of your support and encouragement, and our
project has been finished successfully.

Place: Signature of Students

Date:

iii
DECLARATION

We declare that this written submission represents our ideas in our own words and where
other’s ideas or words have been included, we have adequately cited and referenced the
original sources. We also declare that we have adhered to all principles of academic honesty
and integrity and have not misrepresented or fabricated or falsified any idea/fact/source in our
submission. We understand that any violation of the above will cause for disciplinary action
by the University and can also evoke penal action from the sources which have not been
properly cited or from whom proper permission has not been taken when needed.

Signature of Students with Registration Numbers


Date: ___________

2041018064

2041018064

2041011113

2041004164

iv
REPORT APPROVAL

This project report titled “FAKE NEWS DETECTION USING MACHINE LEARNING
“submitted by Ashutosh Kumar (2041011113), Aditi Rath (2041018064), Ashutosh
Sarangi (2041019145), Indrajit Das (2041004164) is approved for the degree of Bachelor of
Technology in Computer Science and Engineering.

Examiner(s)

________________________
________________________
________________________

Supervisor

________________________

Project Coordinator
________________________

v
PREFACE

Fake news, a term denoting misinformation or disinformation spread via various media
channels, has become a pervasive issue in today's digital era. The importance of detecting and
combating fake news cannot be overstated, as it can distort public perception, influence
elections, and incite social unrest. This report explores the critical need for effective fake news
detection mechanisms. The main issues include the rapid dissemination of false information,
the sophisticated nature of modern fake news, and the challenge of distinguishing it from
genuine news. To address these issues, various algorithms have been employed, such as
Natural Language Processing (NLP) techniques, machine learning models like Naive Bayes,
Support Vector Machines (SVM), Decision Tree, Logistic Regression and Random Forest
algorithms. Among these, the DT algorithm demonstrated the highest accuracy and robustness
in detecting fake news in our experiments, making it the preferred choice.

vi
INDIVIDUAL CONTRIBUTIONS

Ashutosh Kumar Introduction, Project overview, Motivation and Uniqueness.

Aditi Rath Literature survey, Material and methods; Model Diagram,


Methods used, Tools used, Evaluation Measures Used, Result,
Experimental Outcomes
Ashutosh Sarangi System Specifications, Parameters Used

Indrajit Das Existing System, Problem Outcomes.

vii
TABLE OF CONTENTS
Title Page i
Certificate ii
Acknowledgment iii
Declaration iv
Report Approval v
Preface vi
Individual Contributions vii
Table of Contents viii
List of Figures ix
List of Tables x

1. INTRODUCTION 1
1.1 Introduction 1
1.2 Project Overview 1
1.3 Motivation(s) 2
1.4 Uniqueness of the Work 2
1.5 Report Layout 2

2. LITERATURE SURVEY 3
2.1 Existing System 3
2.2 Problem Identification 4

3. METHODS 4
3.1 Dataset(s) Description 5
3.2 Model Diagram 9
3.3 Methods 9
3.4 Libraries 12
3.5 Evaluation Measures 13

4. EXPERIMENTATION AND RESULTS 14


4.1 System Specification 15
4.2 Parameters Used 15
4.3 Results and Outcomes 17
4.4 Result Analysis and Validation 19
5. CONCLUSIONS 19
6. REFERENCES 20
7. REFLECTION OF THE TEAM MEMBERS ON THE PROJECT 21
8. SIMILARITY REPORT 23

viii
LIST OF FIGURES

NO FIGURE NAME PAGE NO

1 Representing Fake Datasets 5


2 Representing True Datasets 6
3 Frequent words in fake news 6
4 Frequent word in true news 7
5 Article per subject 8
6 News percentage representing Pie-Chart 8
7 Model Diagram of Fake news Recognition 9
8 Confusion matrix 18
9 User Interface for prediction of fake news 19

ix
LIST OF TABLES

NO TABLE NAME PAGE NO

1 Performance Metrics of the Classifiers 17

x
1. INTRODUCTION

1.1 Introduction

This introduction provides a concise overview of the project's requirements and


objectives. It outlines the issues or challenges the project aims to address, emphasizing
the initiative's motivations. The originality of the work is underscored, showcasing its
innovative aspects and contributions to the field. Additionally, the report layout section
offers a roadmap for the reader, briefly describing the organizational structure of the
report and guiding them through its various sections.

1.2 Project Overview

Due to the increased density of international information exchange, the average


individual nowadays struggles to distinguish between true and fake news. Users of online
social networks are quickly influenced by the deceptive language used in fake news,
which has already had a profound effect on offline society. To increase the reliability of
information in online social networks, it is crucial to swiftly identify fake news. This
study addresses the problems caused by the elusive characteristics of fake news and the
complex relationships between news items, producers, and subjects. Machine learning-
based methods for identifying false news can help mitigate the harmful effects of
misinformation by providing a more accurate and efficient way to verify the reliability of
news sources.

Numerous examples exist of supervised and unsupervised learning algorithms being used
to categorize text within current fake news corpora. However, most research focuses on
specific datasets or domains, with the political domain being particularly prominent.
Consequently, algorithms trained on a specific type of article do not perform optimally
when exposed to articles from different domains. Developing a general algorithm that
performs well across all news domains is challenging due to the varying textual structures
of articles from different domains.

1
In this research, we propose a machine learning ensemble strategy to address the issue of
fake news detection. Our study examines various textual characteristics that can
distinguish between authentic and fraudulent content. We train several different machine
learning algorithms using a variety of ensemble methods that are not well explored in the
existing literature. These methods enable the effective and efficient training of various
machine learning algorithms. Additionally, we conducted thorough tests on four real-
world datasets that are freely accessible to the public.

1.3 Motivation

The goal is to identify news articles or other materials that make false or deceptive
claims. Fake news detection systems are crucial in curbing the rapid spread of
misinformation through social media platforms and other communication channels by
educating consumers about the characteristics and indicators of fake news. These
technologies enable people to consume news and information safely. Ensemble learners
have proven effective in numerous applications, as these learning models tend to reduce
error rates by utilizing strategies like bagging and boosting.

1.4 Uniqueness of the Work

Various algorithms are used by different fake news detection systems to recognize and
categorize fake news, and the algorithm selected has a big influence on the system's
accuracy. Numerous data sources, such as social media sites, news websites, and fact-
checking databases, are accessible to these systems. These algorithms employ a variety of
features, including linguistic ones like syntax and grammar as well as semantic ones like
word choice and context, to detect fake news.

1.5 Report Layout

The fact that this paper is organized into sub-sections makes it excellent. Details are
easily obtained. Section 1 presents the paper's introduction; Section 2 develops into the
literature review; and Section 3 presents our suggested model. All of the other statistical
majors' results are displayed in Section 4 along with our own. We have finished our paper
with potential future considerations in section 5.
2
2. LITERATURE SURVEY

This part examines the systems and solutions that are currently in use and are important
to the project. It also gives an overview of prior attempts and their flaws. This review
describes the difficulties with the current systems and provides a foundation for
identifying the problems that the project seeks to solve.

2.1 Existing System

In [1], the authors comprehensively compare high-performing models and their


characteristics for fake news detection using both machine learning and deep learning
algorithms. In [2], a transformer-based approach is proposed for fake news detection,
focusing on both news content and social contexts. This study employs Transformer-
based Encoder and Decoder models, achieving superior accuracy in a matter of minutes
using the LIAR and Fake News Net datasets. Patil et al. [3] investigate the effectiveness
of several machine learning algorithms, including Naive Bayes, SVM, and Passive
Aggressive Classifier, for fake news detection. Using a dataset containing both real and
fake news, the SVM model achieved an accuracy of 95.05%. In [4], Khanam et al.
explore various machine learning approaches on the LIAR dataset. They employ
algorithms such as Random Forest, SVM, Decision Tree, Naive Bayes, and XGBoost,
achieving an accuracy of over 75%. Ahmad et al. [5] focus on the use of ensemble
methods for fake news detection, utilizing the Kaggle and ISOT Fake News datasets.
Their study employs algorithms including Random Forest (RF), Linear SVM (LSVM),
K-Nearest Neighbors (KNN), and Logistic Regression (LR), with the RF algorithm
reaching a 99% accuracy rate. In [6], Goswami et al. evaluate multiple machine learning
techniques using the LIAR dataset. Their study, published on SSRN, employs methods
such as XGBoost, Random Forest (RF), AdaBoost, ExtraTrees, and Bagging, with the
Bagging Classifier and AdaBoost achieving an accuracy of 70%.

3
2.2 Problem Identification

Building a trustworthy fake news detection system faces many challenges. One of the
main challenges is that different studies use different databases; therefore, there is no
uniform dataset. It is challenging to compare system performance correctly because of
this lack of standardization.

Furthermore, the sheer volume and speed of online information present significant
hurdles. The rapid diffusion of information and the large volume of content produced and
shared makes it challenging to stay up to date on the latest news and verify its accuracy
before it circulates widely.

Additionally, the deliberate production and dissemination of misleading information by


those with ulterior motives complicate the problem further. Bad actors may use deceptive
tactics, such as creating fake social media accounts or manipulating visual media, to alter
public opinion or advance their own objectives through the spread of false information.

In conclusion, the lack of standardized datasets, the vast amount and speed of information
available online, and the existence of deliberate disinformation efforts by people with
hidden agendas are the challenges in creating a system to identify fake news. Developing
trustworthy methods for spotting and stopping fake news requires addressing these
problems.

3. METHODS

The materials and methods section includes a brief description of the datasets used, as
well as a synopsis of their features. A schematic layout or model diagram is also included
to illustrate the system's or model's structure. A brief description of the project's
methodologies is provided, with a focus on the key algorithms or techniques employed.
The project's technology stack, including any tools or software utilized, is explained.
Furthermore, the evaluation metrics or criteria that were employed to assess the project's
solution's efficacy are examined.

4
3.1 Dataset Description

The datasets used for this investigation are freely available online and are open source.
They include news stories from various domains, both fake and genuine. Fake news
websites present unsupported statements, while authentic news articles provide accurate
accounts of real events. Many of the political statements in these articles can be manually
verified using fact-checking websites like politifact.com and snopes.com. Now, let’s
discuss the datasets that were overlooked in our representation. We acquired the news
article-based datasets from Kaggle [6]. Each article is labeled as either “fake” or “true.”
The dataset includes the title, text, subject, and date of each article. The title is the
headline of the news piece; the text is the main content, detailing the news’s focus; the
subject indicates the nature of the news; and the date shows the publication date.

Figure 1. Representing Fake Datasets

Figure 1 shows the bogus article has the shape (23481, 4), which indicates that it has
23481 rows and 4 columns.

5
Figure 2. Representing True Datasets

Figure 2 shows the actual item has the shape (21417,4), which indicates that it
has 21417rows and 4 columns.

Figure 3. Frequent words in fake news

6
Figure 4. Frequent words in true news

Therefore, we are displaying the graph in two Figure 3 and Figure 4 above based on the
frequency of words in the fictitious dataset. To spot the same deceptive tendencies fake
news articles frequently feature, frequent words can be useful. The first thing we
performed in this procedure was to preprocess the text by removing any commas or
punctuation. Then, we tokenize the large words into smaller ones. The frequency of each
term in the dataset is then counted, and the frequencies are then divided based on the
label of the news story, i.e., authentic, or fraudulent. Here, we identify the words that
appear frequently in the text. It is easy to comprehend the common concepts, themes, or
topics related to fake news by analyzing the frequently used terms.

7
Figure 5. Article per subject

Figure 5 shows how many articles are useable for each subject. The articles are divided
into the following categories: government news, Middle East news, normal news, US
news, leftover news, political news, and world news.

Figure 6. News percentage

8
The percentage or number of articles in the dataset is shown in Figure 6. Consequently,
23481 articles, or 52% of the total, are in the fake version, while 21417 articles, or 48%
of the total, are in the real version.

3.2 Model Diagram

The process of developing a system to detect false news is illustrated in Figure 7. It


entails steps including data preparation, data separation, decision tree classifier use, TF-
IDF feature extraction, and performance evaluation utilizing metrics and a confusion
matrix. The flow of these phases is depicted, and the essential components engaged in
each step of the process are highlighted visually.

Figure 7. Model Diagram

3.3 Methods

Numerous strategies are employed to help the version of a successful acquisition become
ingrained. The archaic phase that is excessively big in this is the proclamation pre-
processing stage. To ensure that this declaration is insufficient for training machine

9
learning models, it involves transforming raw data into a comprehensible format. This
change requires the application of a few processes and techniques. Among the techniques
are function extraction, data disjunction, propensity scaling, unrestricted proclamation,
handling outliers, and managing inattentive statistics. Information scientists can design
audit completed judgments based on the declaration and help to accumulate errors and
inconsistencies by using the ML model's improved circumstance, which improves the
model's ability to determine whether a declaration is suitable for examination and to
launch knowledgeable and factual results pretreatment.

Our information will first be concatenated. Then, to originate our data leaner, we will
acquire the columns that aren't valuable. Since capitalization might disagree between
sources and can lead to duplication or inconsistent values if not standardized, this step is

frequently taken to make the text data more consistent and easier to deal with. For text
analysis tasks, removing punctuation can be helpful because it might not have a major
value.

To achieve a successful implementation, various approaches are utilized. The initial stage
is statement preprocessing, which involves converting raw data into a comprehensible
format, necessary for training machine learning models. Several methods are used in this
transformation, including managing missing data, normalizing statements, scaling biases,
handling outliers, feature extraction, and data partitioning. Pretreatment ensures data is
suitable for analysis, leading to accurate results and enhancing ML models' performance.
It enables data scientists to make informed decisions and reduce errors and
inconsistencies.

First, we will concatenate our data, and then remove non-essential columns to streamline
the dataset. Standardizing text data by handling capitalization and removing punctuation
ensures consistency and ease of use. Now, let's discuss the features of ML. Features are
measurable properties or aspects of data that influence the model's learning process
which provide crucial information for accurate classifications or predictions. We use TF-
IDF, which measures the significance of terms within a document relative to a collection
of documents, aiding in text analysis by highlighting important terms. Next, we split our

10
data into training and testing sets. Our workflow involves defining and evaluating a
decision tree classifier. The pipeline includes three steps: the Count Vectorizer, which
transforms text data into word count matrices; the TF-IDF Transformer, which applies
TF-IDF weights; and the DT-Classifier, which trains a decision tree classifier on the TF-
IDF weighted word counts.

To evaluate fake news detection classifiers, we used the Decision Tree (DT) algorithm, a
supervised learning method that predicts outcomes by recursively splitting data into
subsets based on the most significant features.

With an assumption of predictor independence, Naive Bayes is a probabilistic machine


learning algorithm based on Bayes' Theorem. It is frequently applied to classification
problems, such as the identification of false news.

3.3.1 Support Vector Machine: - The SVM algorithm's goal is to locate a hyperplane
that, as much as feasible, divides data points from one class to another. The approach
finds such a hyperplane only for linearly separable problems; for most real-world
problems, it optimizes the soft margin, permitting a limited amount of misclassifications.
A portion of the training observations that pinpoint the location of the dividing
hyperplane are referred to as support vectors.

3.3.2 Logistic Regression: - A Logistic Regression (LR) is the appropriate regression


analysis to conduct when the dependent variable is dichotomous (binary). Like all
regression analyses, logistic regression is a predictive analysis. It describes data and
explains the relationship between one dependent binary variable and one or more
nominal, ordinal, interval, or ratio-level independent variables.

3.3.3 Random Forest: - Another popular machine learning algorithm for a variety of
tasks, including the identification of false news, is Random Forest (RF). This kind of
ensemble learning technique constructs several decision trees and combines them to get a
forecast that is more reliable and accurate. Using bootstrapping, a random subset of
features and a random subset of data are used to train each tree.

11
The final stage is prediction. After training on historical data, the algorithm predicts
outcomes based on new information. We assess performance using metrics like F1 score,
recall, precision, and accuracy. A confusion matrix, which displays predicted versus
actual values, helps evaluate the classifier's performance. The model's structure is
paramount in this process.

3.4 Libraries

Key Python modules used in the project include matplotlib for data visualization, NumPy
for numerical computations, Pandas for data analysis and manipulation, and Scikit-Learn
for machine learning tasks. These technologies make it possible for efficient data
processing, analysis, visualization, and machine learning algorithm application to achieve
the project's objectives.

3.4.1 NumPy

NumPy, a core Python module, plays a vital role in fake news detection programs by
providing efficient numerical computations and array operations. Its array data structure
enhances data representation, allowing for the storage and manipulation of multi-
dimensional arrays.

3.4.2 Pandas

Pandas play a crucial role in fake news detection by efficiently handling and transforming
datasets. It provides a high-level data structure called a data frame, which simplifies
organizing, exploring, and preprocessing the dataset. Overall, Pandas is a key component
of the code, streamlining dataset handling, preprocessing, and feature creation.

3.4.3 Matplotlib

Matplotlib is a powerful Python data visualization package that can help summarize the
results and enhance the study of false news detection models. When it comes to
identifying false news, Matplotlib can be utilized in a multitude of ways to provide an
understanding of the predictions and summarize the model's performance.

12
3.4.4 Scikit-Learn

Scikit-learn, a well-known Python machine-learning toolkit, is largely used in the false


news detecting code. It is essential to several operations, including as evaluation, training,
and preparation of data. Using all of Scikit-learn's functionalities is demonstrated in the
code. The code demonstrates using every feature of Scikit-learn. The `train_test_split`
function from Scikit-learn is used to divide the dataset into training and testing sets
initially.

3.4.5 NLTK

Natural Language Toolkit is a potent Python library that is widely used for tasks related to
natural language processing (NLP), such as the identification of false news. Machine
learning models are constructed using the numerical representations of text data
following feature extraction. A variety of algorithms, including Support Vector Machines
(SVM), Naive Bayes, and even deep learning methods, can be used.

3.4.6 Word Cloud

The magnitude of each word in a word cloud, which is a visual representation of text
data, shows the term's relevance or frequency within the text corpus. Word clouds can be
quite useful for both exploratory data analysis and feature extraction when it comes to
machine learning (ML) techniques for false news identification.

3.5 Evaluation Measures

Several evaluation techniques can be used to assess the effectiveness of the classifier in
identifying fake news using machine learning. Commonly employed evaluation metrics
include:

1. Accuracy: This metric calculates the percentage of correct predictions made by


the classifier out of the total number of predictions.

𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 (1)

13
2. Precision: Precision measures the proportion of true positive predictions relative
to the total number of positive predictions.

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (2)

3. Recall: Also known as sensitivity or the true positive rate, recall measures the
proportion of actual positive events that the classifier correctly identified.

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁 (3)

4. F1-score: This metric combines precision and recall into a single statistic, offering
a balanced assessment that accounts for both false positives and false negatives.
The F1-score is especially useful when dealing with imbalanced datasets.

𝑇𝑃
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 1 (4)
𝑇𝑃+ (𝐹𝑃+𝐹𝑁)
2

Where, TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative.

5. Confusion Matrix: A confusion matrix is a tabular representation that shows the


true positive, true negative, false positive, and false negative predictions.

Using functions provided by libraries like Scikit-learn, these evaluation metrics can be
calculated. Combining these metrics gives a comprehensive understanding of the model's
effectiveness in detecting fake news, including its accuracy, precision, recall, and the
distribution of predictions, as shown in the confusion matrix.

4. EXPERIMENTATION AND RESULTS

The system specs utilized in the project are described in the results and output section.
Additionally, it describes the variables that were changed or kept under control during the
experiments or simulations. The part concludes by presenting the experimental findings,
together with any statistical metrics or visuals that demonstrate the system's performance
or efficacy.

14
4.1 System Specifications

To complete this project, the following hardware specifications are required:

• Processor: 11th-generation Intel i5

• RAM: 8GB

• Storage: 512GB SSD

• Graphics Card: NVIDIA GEFORCE GTX 1650

The operating system should be Windows (x64 bit). For Python development, the VS
Code is used. This robust hardware setup and software configuration will ensure the
project can be designed and executed effectively, maximizing performance and
productivity.

4.2 Parameters Used

To detect fake news using machine learning, we implemented several preprocessing and
feature extraction techniques, leveraging Count Vectorizer and TF-IDF for feature
representation and employing Decision Tree Classifiers for classification. The
preprocessing stage involved meticulous text preprocessing to standardize the textual data
and make it suitable for machine learning algorithms. We started by converting all text to
lowercase using the `apply()` function from the pandas library. This function applies the
supplied function to each element in a data-frame. In this example, a lambda function was
used to apply the 'lower ()' function to each text element in the 'text' column, maintaining
uniformity by converting all uppercase letters to lowercase. This step is critical because it
standardizes the text data, eliminating conflicts caused by differences in capitalization
between sources.

Next, we removed common words, known as stop words, using the `stop-words` library.
In language, such as "a," "the," "is," and "are," which often have little meaning and can
interfere with the analysis of the underlying text, stop words are frequently used. The
download () function is used for downloading a corpus of stop-words. Then, the `stop-
words. words function call returns a list of all the stop words in English. We applied

15
another lambda function to each element in the 'text' column of the Data- Frame, splitting
each text element into individual words using the `split()` function.

Tokenization, a crucial step in text preprocessing, involves splitting text into tokens
(words). By converting all text to lowercase during vectorization, the Count Vectorizer
ensures that variations in capitalization do not result in separate tokens. Additionally, the
Count Vectorizer allows for the elimination of stop words, further refining the textual
data by excluding words that frequently occur but carry little meaningful information.
This refinement ensures that the text representation focuses on more significant terms that
contribute to distinguishing between real and fake news.

We used Term Frequency-Inverse Document Frequency (TF-IDF) method for feature


extraction. This is a statistical method that keeps in track the importance of a word in a
document relative to a collection of documents (corpus). It consists of two components:
Term Frequency (TF) and Inverse Document Frequency (IDF). Term Frequency (TF)
measures the frequency of a word in a document. It is obtained by counting the number
of times a word appears in a document, and then dividing this count by the total number
of words in the document.

We used the train_test_split function in scikitlearn's library to split datasets into training
and testing subsets to train models and evaluate them. For the training of a model to
recognize patterns and features in news articles, it is necessary to use an exercise set and
test sets while evaluating its performance on untraceable data.

We chose the Decision Tree Classifier because it has various advantages in the context of
detecting bogus news. Decision trees are extremely interpretable, providing for simple
comprehension and justification of the model's conclusions. This interpretability is
critical in fake news detection since it is necessary to comprehend the reasons behind
classifying news as real or fake. Decision trees can handle category and numerical
features, making them useful for assessing various data types found in news items,
including text, headlines, and metadata.

16
The Decision Tree Classifier's overall simplicity, interpretability, versatility, and capacity
to recognize complex patterns make it a crucial tool in fake news detection, enabling
clear and understandable decision-making and contributing to the identification of
significant features that enhance the credibility of the classification results.

In summary, our approach to fake news detection involves comprehensive text


preprocessing and feature extraction, followed by model training and evaluation using
robust techniques.

4.3 Results and Outcomes

As indicated in Table 1, these are the outcomes we obtained after applying the decision
tree classifier and using it improved our results. As shown in Table 1, we now have a f1-
score of 99.61%, an accuracy of 99.6%, a precision of 99.75, and a recall of 99.52%.

Table 1. Performance Metrics of the Classifier

Classifier Accuracy Precision Recall F1 Score


Naive Bayes 93.65% 94% 94% 94%

Random Forest 98.75% 99% 99% 99%

Logistic Regression 98.78% 99% 99% 99%

Support Vector Machine 99.33% 99% 99% 99%

Decision Tree 99.67% 100% 100% 100%

17
Figure 8. Confusion matrix

We explore many criteria to compose the ability of the methods the cm (confusion-
matrix) is the foundation for the absolute majority of them an assortment of model
executions on the test set is tabulated as a cm (confusion-matrix). The metric that is
repeatedly employed is accuracy. Indicating the preparation of accurately predicted
observations that were either right or fraudulent.

In reaction to the predictions produced using the decision-tree classifier, we created a cm


(confusion- -matrix). Figure 8 is a cm (confusion- -matrix) that displays a table set of the
assorted predictions and outcomes of a classification conundrum and aids in determining
its solvent. It creates a board with all of a classifier's awaited and correct values.

18
4.4 Result Analysis and Validation

Figure 9. User Interface for prediction of fake news

Lastly, this is the user interface of our project, which works on the best working model,
which is the Decision Tree Classifier.

A decision tree model must be developed and evaluated, an intuitive and user-friendly
interface must be designed, the model must be integrated with the UI via a backend
service, and the model must be continuously improved based on user feedback and model
performance. This is how a decision tree classifier-based UI for fake news detection is
designed. This theoretical foundation guarantees that the system is user-friendly and
easily comprehensible in addition to being accurate in identifying bogus news.

5. CONCLUSION

Identifying false information is essential and difficult in today's digital age. The quick
growth of social networks and online platforms has resulted in the widespread spread of
false information, which supports anti-social behaviors and has a big impact on social
digital marketing. Identifying false information remains a persistent and complicated
hurdle, with no one solution capable of stopping its dissemination. A comprehensive
plan is required, which includes technological strategies, media knowledge, and critical
thinking skills. Methods such as machine learning and natural language processing NLP
were developed to deal with this issue. Education and technology are essential to fight
false news, as they can help people acquire analytical thinking skills that enable them to

19
evaluate the reliability of information sources and their quality. Technology businesses
must collaborate with policymakers to successfully combat the spread of false
information. Through collaboration, we can improve transparency and decrease the
dissemination of false information in the digital environment. Ongoing research,
partnerships among different groups, and the creation of innovative technologies are
crucial for tackling these problems and protecting the accuracy of information in the
digital age.

6. REFERENCES

[1] Wang, Y., Qian, X. Li Y., & Zhang, H. (2018). Fake news detection on social
media: A data mining perspective using a hybrid deep learning model. ACM
Transactions on Management Information Systems (TMIS), 9(3), 1-21.

[2] Albahr, A., & Albahar, M. (2020). An empirical comparison of fake news
detection using different machine learning algorithms. International Journal of
Advanced Computer Science and Applications, 11(9).

[3] Khan, A. I., Shahzad, F., & Ali, S. (2019). Fake news detection: a deep learning
approach using CNN. IEEE Access, 7, 44112-44121. doi: 10.1109/ACCESS.2019
.2901590.

[4] Thakur, P., Shah, R. R., & Rana, N. P. (2020). A survey on automated fake news
detection: Trends and challenges. Information Processing & Management, 57(2),
102026. doi: 10.1016/j.ipm.2019.102026.

[5] Kumar, R., Singh, R. K., & Roy, P. P. (2021). Fake news detection on social media:
A review. Artificial Intelligence Review, 54(4), 2997-3030. doi: 10.1007/s10462-
020-09981-4.

[6] Allcott, H., & Gentzkow, M. (2017). Social Media and Fake News in the 2016
Election. Journal of Economic Perspectives, 31(2), 211-236. https://doi.org/10.12
57/jep.31.2.211

20
[7] Conroy, N. J., Rubin, V. L., & Chen, Y. (2015). Automatic deception detection:
Methods for finding fake news. Proceedings of the Association for Information
Science and Technology, 52(1), 1-4. https://doi.org/10.1002/pra2.2015.145052010
082

[8] Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on
Social Media: A Data Mining Perspective. ACM SIGKDD Explorations
Newsletter, 19(1), 22-36. https://doi.org/10.1145/3137597.3137600

[9] Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth of
Varying Shades: Analyzing Language in Fake News and Political Fact-Checking.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2931-2937. https://doi.org/10.18653/v1/D17-1317.

7. Reflection of the Team Members on The Project

Working on the project to identify fake news using machine learning techniques like
decision trees has given our team new knowledge and experience. We were able to
achieve our objectives through efficient teamwork since every team member brought a
unique set of skills and perspectives to the table. In this reflection, each of us will offer
our unique viewpoints and project-related contributions.

During this research, I, Aditi Rath (2041018064), was mostly interested in feature
engineering and data pretreatment. I collected and analyzed a large number of news items
using techniques like tokenization, stemming, and TF-IDF in order to extract useful
features. Using decision tree algorithms, I was able to identify crucial features that help
identify false news. In order to improve the decision tree model's performance, I also
performed cross-validation and made hyperparameter adjustments. In addition, I actively
participated in team meetings and discussions by bringing up ideas to improve workflow
overall and boost team output. I was centered on creating and evaluating the decision tree
model. In order to understand the preprocessed data and select the appropriate features
for the model's training. To assess the precision, recall, and accuracy of the model, I
employed decision tree approaches and conducted a comprehensive testing procedure. I

21
also investigated ensemble parameters like the Tf-Idf transformer and Count vectorizer to
see whether they could help the model perform better. Throughout the project, I
encouraged discussions with the team about potential adjustments and future directions
by sharing the data and conclusions with them. Finally, I created comprehensive
evaluation standards, including F1 score, recall, accuracy, and precision, to rank the
decision tree model's performance. I also utilized methods like confusion matrix analysis
to look at the pros and cons of the model and made an user Interface as well.

My contribution to the project, Ashutosh Kumar (2041011113), I worked on the overall


project overview as to why was it necessary making the Fake News Detection system
with maintaining the dignity and integrity of other working models in existing system. I
also worked on the motivations behind doing this project which was maintaining the
authenticity of the information that is getting shared throughout the internet. There is
uniqueness to our project as it compares four traditional Machine Learning models to our
model to compare each of their accuracies to find the best. Throughout the project, I
encouraged discussions with the team about potential adjustments and future directions
by sharing the data and conclusions with them.

For the model's training, I, Ashutosh Sarangi (2041019145), focused on obtaining and
annotating a reliable dataset. I worked very hard to locate trustworthy news sources and
ensured that the dataset was of the highest caliber. I also worked on the system
specifications to actually analize what was best for our project along with ensuring the
smooth application of our project. I collaborated with my team members to finalize what
were the parameters we were using for the successful application of our project.

I, Indrajit Das(2041004164) ,my main contribution to the research was centered on the
assessment and interpretation of the model. I have analyzed and viewed the existing
system through reading news articles of fake news detection topics throughout the
internet and went across the solutions that we already have with which I came across the
problem outcomes which was developing this project which compared traditional ML
algorithms to find the best one.Overall, we were able to tackle the challenging problem of
identifying fake news by effectively cooperating and utilizing our unique
individual skills.
22
8. SIMILARITY REPORT

23

You might also like