0% found this document useful (0 votes)
3 views

FYP (3)

This document discusses the challenges of fake news detection exacerbated by the rise of social media and misinformation, particularly during the coronavirus pandemic. It outlines a project that aims to enhance fake news detection using various machine learning models, including Support Vector Machines and XGBoost, and emphasizes the importance of a large dataset for reliable results. The project also focuses on developing a robust detection system that can be applied in real-time environments to combat the spread of fake news.

Uploaded by

Gajera Dipen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

FYP (3)

This document discusses the challenges of fake news detection exacerbated by the rise of social media and misinformation, particularly during the coronavirus pandemic. It outlines a project that aims to enhance fake news detection using various machine learning models, including Support Vector Machines and XGBoost, and emphasizes the importance of a large dataset for reliable results. The project also focuses on developing a robust detection system that can be applied in real-time environments to combat the spread of fake news.

Uploaded by

Gajera Dipen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Abstract

The ongoing spread and expansion of information technology and so-


cial media sites have made it easier for people to access different types
of news – political, economic, medical, social, etc. – through these plat-
forms. This rapid growth in news outlets and the demand for infor-
mation has blurred the lines between real and fake news, leading to the
dissemination of fake news, which is a dangerous state of affairs. The
outbreak of the coronavirus pandemic and the rising awareness of the
dangers it posed globally saw a parallel increase in fake news and rumors.
As a result, people became unsure of what to believe and questioned the
reliability of the information they encountered, creating an environment
where rumors, false news, humor, and unsubstantiated claims sowed
panic and propagated misleading ideas. This surge in misinformation
undermined public confidence, freedom of speech, journalistic integrity,
and general clarity.A number of research studies have leveraged diverse
datasets and achieved high levels of accuracy in fake news detection.
However, the specific issue of fake news related to the coronavirus pan-
demic has received less attention, and the few studies on this subject
have been limited by small datasets or specific categories. This thesis
aims to address this gap by improving fake news detection through the
application of machine learning models, including Support Vector Ma-
chines (SVM), Random Forest (RF), XGBoost and Logistic Regression,
using a large dataset (44,898 rows) to identify fake news.

1
1 Introduction
In the modern digital era, distinguishing between authentic news and misin-
formation has become increasingly challenging. The rapid rise of social media
and other online platforms has facilitated the swift spread of fake news, lead-
ing to misinformation, public alarm, and societal disruption. Fake news is not
solely driven by commercial interests aiming to increase viewer engagement
and advertising profits. It has also become a powerful tool for individuals
and groups with potentially harmful intentions. These actors use fake news to
manipulate global events, shape public opinions, and influence policies, posing
serious threats to democratic processes and social cohesion. Addressing this
complex issue, our project adopts a comprehensive approach. Using advanced
machine learning methods, we compare several cutting-edge algorithms, in-
cluding Naive Bayes, Support Vector Machine (SVM),XGBoost. The aim is to
evaluate each algorithm’s strengths, limitations, and their applicability in fake
news detection. This detailed analysis serves as the groundwork for developing
a hybrid model that combines the strengths of each approach. By integrating
the most effective elements of these techniques, we intend to create a robust,
adaptive system capable of efficiently identifying and combating the diverse
challenges presented by fake news.

1.1 Appplications
Fake news detection technology has found widespread applications across var-
ious sectors, helping to mitigate the harmful impact of misinformation on so-
ciety. In the media and journalism industries, these technologies play a crucial
role in maintaining the integrity of information. By verifying the accuracy of
stories before publication, they help prevent the spread of misleading narra-
tives. Automated fact-checking tools, powered by advanced machine learning
algorithms, assist journalists in cross-referencing information, ensuring that
only credible content is shared with the public.
By swiftly identifying and addressing misinformation, they not only protect
individual users from false narratives but also prevent the amplification of
misinformation on a global scale.
Furthermore, in global security and emergency response scenarios, fake
news detection is critical for ensuring accurate and timely information. Mis-
information during emergencies can lead to chaos, hamper rescue efforts, and
put lives at risk.

2
1.2 Motivation
The profound impact of misinformation on individuals, societies, and democra-
cies worldwide drives the innovation behind fake news detection technologies.
In an era saturated with digital information, the rapid spread of false news
has disrupted public discourse, eroded trust in the media, and instigated so-
cial discord. Fabricated stories and misleading narratives can fuel societal
tensions, inciting fear, prejudice, and hatred. In the political sphere, misinfor-
mation can be exploited to influence elections and policy-making, threatening
the core principles of democratic governance. Economically, the dissemination
of false information affects businesses, stock markets, and consumer behavior,
contributing to instability within the financial system.
Detecting false news is not just a technical challenge but also a moral
responsibility to preserve truth and foster public awareness. Ultimately, by
leveraging advanced algorithms and technologies, we can differentiate between
fact and fiction. This empowers individuals to make well-informed decisions,
reinforces democratic processes, rebuilds public trust in media, and equips
society with the tools to act responsibly.

1.3 Objectives
The primary objectives of this project are as follows:

• This project aims to tackle the challenges of fake news detection by


utilizing a variety of machine learning models, including Support Vector
Machines (SVM), Naive Bayes, Random Forest (RF), XGBoost.

• Our primary objectives are to assess the performance of these models,


compare their effectiveness, and identify the most appropriate method
for accurate and efficient detection of fake news in the digital era

• By undertaking this analysis, we hope to contribute to the ongoing ef-


forts to strengthen information ecosystems and safeguard against the
widespread threat of fake news.

• In addition to evaluating the accuracy of these models, we will also focus


on computational efficiency, ensuring that the models can be scaled and
deployed in real-time environments, such as social media platforms and
news aggregators.

3
• Furthermore, we will explore the impact of various data preprocessing
techniques, such as text summarization and feature selection, on model
performance, aiming to enhance detection accuracy and reduce false pos-
itives.

• By incorporating these considerations, our project seeks to create a more


robust and adaptable fake news detection system.

1.4 Contribution
• Our project compares the performance of various machine learning algo-
rithms, including Support Vector Machines (SVM), Naive Bayes, Ran-
dom Forest (RF), XGBoost. By conducting a detailed evaluation of
these models, we provide insights into the strengths and limitations of
each algorithm in the context of fake news detection.

• A notable contribution of this work is the use of a large dataset (44,898


rows), ensuring that the findings are based on a wide variety of news
articles. This enhances the reliability of the model evaluations and makes
the results more generalizable to real-world scenarios.

• The system was designed to run entirely in a web browser, ensuring


that users do not need additional hardware or software, making it highly
accessible.

• In addition to accuracy, the project focuses on the computational ef-


ficiency of the models, ensuring that they can be scaled for real-time
application on platforms such as social media and news websites. This
paves the way for potential real-world deployment in environments where
quick and accurate detection of fake news is crucial.

• The findings and methodology of this project serve as a foundation for


future research in the field of fake news detection. By providing detailed
comparisons of different machine learning techniques, the project offers
a benchmark that future studies can use to develop more sophisticated
and efficient models for identifying misinformation.

4
1.5 Organization of Project Report
1.5.1 Chapter 1: Introduction
1.5.2 Chapter 2: Literature Survey
1.5.3 Chapter 3: Proposed Algorithm
1.5.4 Chapter 4: Simulation and Results
1.5.5 Chapter 5: Conclusion and Future Work

2 Literature Survey
The rapid growth of social media and the widespread availability of digital
news sources have contributed to an unprecedented rise in the dissemination
of misinformation, often labeled as ”fake news.” This proliferation of false in-
formation presents significant societal challenges, including political manipu-
lation, public health misinformation, and the erosion of public trust in credible
news sources. As a response, researchers have explored a variety of machine
learning algorithms to effectively identify and mitigate the spread of fake news.
The literature on fake news detection primarily revolves around the use of tra-
ditional machine learning algorithms, which are valued for their explainability,
ease of implementation, and effective performance on moderate datasets.
Support Vector Machines(SVMs) have been widely utilized in the
detection of fake news due to their robustness in binary classification. SVM is
effective in finding a hyperplane that best separates different classes, such as
fake and real news, particularly when dealing with high-dimensional data. This
approach has consistently demonstrated strong performance in distinguishing
false from factual content, making it a popular choice among early researchers
in the field. The capacity of SVM to work with different kernel functions
allows for both linear and non-linear boundaries, making it flexible in handling
complex textual data features.
Extreme Gradient Boosting(XgBoost) has also gained attention for
its application in fake news detection. As a form of gradient boosting, XGBoost
iteratively improves model performance by minimizing classification errors of
weak learners and then combining them into a more accurate overall model.
The efficiency of XGBoost in handling large datasets and its ability to model
complex relationships within the data make it well-suited for detecting nuanced
signals indicative of fake news, such as the presence of misleading language,

5
sensationalist headlines, or unverified sources. XGBoost’s success in fake news
detection largely stems from its boosting mechanism, which sequentially en-
hances the capabilities of weak learners, thereby producing a highly predictive
model that outperforms many traditional methods.
The application of machine learning to fake news detection also necessi-
tates careful consideration of feature engineering, where textual features are
extracted using Natural Language Processing (NLP) techniques. Common
methods such as Term Frequency-Inverse Document Frequency (TF-IDF),
Bag-of-Words (BoW), and Word2Vec are often used to convert text into nu-
merical vectors suitable for model input. These features are then employed by
machine learning models to identify patterns that may indicate falsehood, such
as specific phrases, stylistic patterns, or unusual language usage commonly as-
sociated with misinformation. Effective feature extraction and selection are
critical, as they determine the quality of input provided to the machine learn-
ing models, thus directly impacting the detection accuracy.
In summary, the literature on fake news detection emphasizes the effec-
tive use of traditional machine learning algorithms like SVM, Random Forest,
XGBoost, and LightGBM. These methods have been integral in achieving sig-
nificant accuracy in the identification of fake news due to their strong predictive
capabilities and adaptability to large datasets with complex patterns. Each
algorithm presents unique advantages—SVM’s strength in binary classifica-
tion, Random Forest’s robustness via ensemble learning, XGBoost’s efficiency
through boosting, and LightGBM’s computational speed—all of which con-
tribute to their applicability in combating the challenge of fake news prolifer-
ation. These models, supported by well-crafted feature engineering and NLP
techniques, continue to form the foundation for effective and scalable fake news
detection systems.

6
3 Methodology
3.1 Data Collection
The study utilized datasets sourced from Kaggle. This dataset comprises
approximately 16,600 rows derived from various online articles. The train-
ing dataset required extensive pre-processing, as demonstrated by our source
code.The full training dataset includes several attributes:

1. ID: A unique identifier for each news article.


2. Title : The headline or title of the news article.
3. Author : The author of the news article.
4. Text : The body text of the article, which may be incomplete in some
instances.
5. Label : This attribute categorizes the article as potentially unreliable, with
possible values being:
- 1 : Represents unreliable articles.
- 0 : Represents reliable articles.

3.2 Data Processing


A comprehensive data preprocessing pipeline was established to prepare the
dataset for machine learning purposes. This pipeline included various steps
such as text normalization, managing missing values, and addressing any out-
liers present in the data. To improve the quality of the textual data and reduce
noise, several text cleaning techniques were applied. These techniques included
removing stop words and punctuation, as well as performing stemming to sim-
plify the words to their root forms.

3.3 Feature Extraction and Pre-Processing.


In our methodology, we begin by utilizing Optical Character Recognition
(OCR), a technology used to transform various document types—such as
scanned paper documents, PDFs, or images—into editable and searchable text
data. OCR works by analyzing the visual patterns of characters and converting
them into machine-readable text, making it easier to edit, search, and manage
information. Initially developed in the 1950s, early OCR systems had limited
capabilities, recognizing only specific fonts and languages. Today, however,

7
advancements in machine learning and neural networks like CNN and RNN
have vastly improved OCR’s capabilities, enabling it to accurately process
diverse fonts and languages. OCR now plays a vital role in digitizing materials,
automating data entry, and supporting text extraction for AI and machine
learning applications, making it an essential tool across multiple industries.
Once the OCR step is completed and we have extracted the text from
scanned images, the next key stage is employing the Doc2Vec model to gen-
erate vector embeddings. The objective of Doc2Vec is to create a numerical
representation for each document, encapsulating its key content and context.
Before the text is fed into the Doc2Vec model, it undergoes preprocessing to
improve the quality of the generated embeddings. This preprocessing includes
removing stopwords, special characters, and punctuation, as well as converting
all text to lowercase to ensure consistency. The result is a clean and standard-
ized list of words, ready for further processing. Doc2Vec, introduced in 2014,
is an extension of the Word2Vec model, which was originally designed for word-
level embeddings. While Word2Vec creates word vectors and aggregates them
to represent a document, it lacks the ability to maintain word order, which
can be critical for capturing context. Doc2Vec addresses this by adding a
”document vector” to the representation, preserving contextual relationships
within the document, which helps in capturing the nuances and sequence of
words more effectively.
This ability to retain word order is particularly advantageous for our use
case, as it enhances the model’s capability to understand subtle differences
and relationships within the text. Ultimately, this leads to a more comprehen-
sive and accurate analysis, improving the depth and insight of our subsequent
modeling process.

3.4 Machine Learning Models


3.4.1 Support Vector Machines(SVMs)
Support Vector Machines are a type of supervised machine learning al-
gorithm used primarily for classification tasks. Initially, SVMs were designed
to handle linear classification problems. However, real-world data is often
non-linearly separable, which limited the original version of SVM. This lim-
itation was overcome with the introduction of the kernel trick, allowing the
algorithm to perform non-linear classifications. A kernel function maps data
to a higher-dimensional space, making it easier to classify complex datasets.

8
One popular kernel is the Radial Basis Function kernel which is also
known as the Gaussian Kernel . The RBF kernel is particularly useful
because it can handle non-linear relationships between the data points by
mapping them into a higher-dimensional space where they become linearly
separable. The RBF Kernel formula can be given as:

∥x − x′ ∥2
 

K(x, x ) = exp − (1)
2σ 2

In document classification tasks, transforming text into numerical feature

vectors is crucial. The Doc2Vec model, an extension of Word2Vec, enables the


transformation of entire documents into fixed-length vectors. This is useful
because these vectors represent the semantic similarity between documents,
which is important when feeding data into machine learning models like SVM.
By using the Doc2Vec technique, we can generate meaningful feature vec-
tors for documents. Then, by applying the RBF kernel in an SVM, we can
measure the similarity between these vectors in a way that reflects their orig-
inal semantic relationships. The distance between feature vectors, computed
through the kernel, ensures that documents with similar content are correctly
classified together.
At its core, SVM operates on a fundamental principle—the creation of a
maximal ”street” to distinctly demarcate diverse data classes. This principle
aligns with the overarching objective of maximizing the separation between
data clusters,ultimately manifesting as an optimization problem in SVM for-
mulation.  
1  T

arg max min tn w ϕ(xn ) + b (2)
w,b ∥w∥ n
tn wT ϕ(xn ) + b ≥ 1, n = 1, . . . , N

(3)
By leveraging this mathematical approach, we are able to navigate the SVM
optimization landscape with greater flexibility and achieve optimal results in
classifying and separating the data clusters effectively
N
1 2
X
an tn wT ϕ(xn ) + b − 1
 
L(w, b, a) = ∥w∥ − (4)
2 n=1

where
an ≥ 0, n = 1, . . . , N

9
3.4.2 Naı̈ve Bayes:
To establish a baseline accuracy for our dataset, we utilized a Naive Bayes clas-
sifier, specifically the Gaussian Naive Bayes version provided by scikit-learn.
Gaussian Naive Bayes is well-known for being one of the simplest classification
techniques, using a probabilistic approach based on the assumption that all
features are conditionally independent given the class label. This assumption
simplifies the computational process and is a fundamental aspect of the Naive
Bayes classifier.
We used Doc2Vec embeddings in our Naive Bayes classifier, similar to our
process with other models. Using these embeddings enables the classifier to
better understand and differentiate between different pieces of text data, en-
hancing the overall classification.
The Naive Bayes Rule, which forms the core of this classifier, relies on
Bayes’ theorem, a fundamental concept in probability theory. This rule allows
us to estimate the probability of a specific class given a set of observed fea-
tures, thus driving our classification decisions. Implementing the Naive Bayes
classifier at this stage allows us to establish a benchmark accuracy to evaluate
the performance of subsequent models and improvements.

P (x | c)P (c)
P (c | x) = (5)
P (x)

3.4.3 Extreme Gradient Boosting


Extreme Gradient Boosting is a prominent ensemble learning algorithm
commonly used for machine learning tasks, particularly regression and classi-
fication. It leverages the boosting technique to enhance accuracy by sequen-
tially improving weak models. XGBoost starts by building a decision tree,
followed by retaining residuals from previous trees. These residuals are used
as inputs for subsequent trees to correct prior errors, thereby refining the pre-
diction accuracy. The process continues until the loss function is minimized
or the specified number of trees is reached, ultimately improving the model’s
forecasting capability through gradient boosting.

10
4 Results and Simulation
The results and evaluation phase focused on assessing the performance of each
model implemented for fake news detection. Metrics such as accuracy, preci-
sion, recall, and F1 score were calculated for each model, offering a compre-
hensive view of their strengths and weaknesses.

(a) SVM (b) Naive Bayes

(c) XGBoost

Figure 1: Confusion Matrices for SVM, Naive Bayes, and XGBoost

11
4.1 Comparisons

Model Accuracy Recall Precision F1 score


SVM 91.28% 0.92 0.90 0.91
Naive-bayes 71.92% 0.86 0.68 0.75
XGBoost 86.45% 0.92 0.77 0.84

Table 1: Performance metrics for different models

12
5 Conclusion and Future Scope
Summary: This comprehensive investigation into fake news detection us-
ing various machine learning models highlights the performance of XGBoost,
SVM, and Naive Bayes. Among these models, XGBoost demonstrated re-
markable accuracy in addressing the complexities of fake news detection. The
investigation summarizes the findings, showcasing the potential of leveraging
different machine learning models, each with its strengths, to effectively tackle
the challenges posed by fake news.
Significance: The significance of this project lies in its contribution to the
field of fake news detection by comparing traditional machine learning models
and their abilities to capture the nuances of language. By emphasizing the
effectiveness of models like XGBoost and SVM, the study underscores the im-
portance of applying suitable techniques to handle intricate linguistic features.
The findings enhance our understanding of effective strategies for detecting
fake news and advocate for the integration of traditional and advanced ma-
chine learning approaches for more accurate and robust detection in real-world
applications.
Future Work: Moving forward, the conclusion serves as a foundation for
future research in fake news detection. Given the evolving nature of misinfor-
mation and the continuous development of deceptive strategies, this section
suggests multiple pathways for enhancement and exploration. These include
optimizing current models for better performance, exploring new model ar-
chitectures for deeper insights, and mitigating biases in datasets to improve
model generalizability. This perspective recognizes that effectively combating
misinformation demands continuous innovation and adaptation.
Conclusion: In conclusion, this project not only provides an understand-
ing of the current landscape of fake news detection but also lays the ground-
work for a proactive and adaptive approach to upcoming challenges. The entire
process, from defining the problem to evaluating the models, emphasizes the
necessity of employing advanced techniques to find effective and refined solu-
tions for tackling the widespread issue of fake news.[1]

13
References
[1] Uma Sharma, Sidarth Saran, and Shankar M Patil. Fake news detection us-
ing machine learning algorithms. International Journal of creative research
thoughts (IJCRT), 8(6):509–518, 2020.

14

You might also like