Fake News Detection System Report
Fake News Detection System Report
SYSTEM
A project report Submitted
for completion of the degree of
Bachelor of Technology
SUBMITTED BY
AVIK GHOSH [20/IT/029] [10300220029]
ANANT KUMAR [20/IT/015] [10300220015]
BISWAJIT DEBNATH [20/IT/035] [10300220035]
NAIRRIT MUKHOPADHYAY [20/IT/059] [10300220059]
………………………………………
1
DECLARATION
We hereby declare that this project work titled “FAKE NEWS DETECTION SYSTEM”
is our original work and no part or it has been submitted for many other degree’s purposes
or published in any other from till date.
Signature of Students
…………………………………………….
…………………………………………….
………………………………………..........
BISWAJIT DEBNATH [20/IT/035] [10300220035]
………………………………………………
NAIRRIT MUKHOPADHYAY [20/IT/059] [10300220059]
2
ACKNOWLEDGEMENT
We hereby wish to express our sincere gratitude and respect to Amit Sur, Assistant
Professor, Department of Information Technology, Haldia Institute of Technology under
whom we had proud privilege to work. His valuable guidance and encouragement have
really led me to the path of completion of this project. Any amount of thanks would not be
enough for the valuable guidance of my supervisor.
We would also like to thank all the faculty members of department of Information
Technology for their devoted help. I also cordially thank all laboratory assistants for their
cooperation.
Finally, we would like to pen down our gratitude towards our family members for their
continuous support and encouragement. It would not have been possible to complete our
work without their support.
3
ABSTRACT
In recent times, due to the booming development of online social networks, fake news for
colorful marketable and political purposes has been appearing in large figures and wide in
the online world. With deceptive words, online social network druggies can get infected by
these online fake news fluently, which has brought about tremendous goods on the offline
society formerly. An important thing in perfecting the responsibility of information in
online social networks is to identify the fake news timely. This paper aims at probing the
principles, methodologies and algorithms for detecting fake newspapers, generators and
subjects from online social networks and assessing the corresponding performance.
Information closeness on the Internet, especially on social media, is a decreasingly
important concern, but web- scale data hampers, capability to identify, estimate and correct
similar data, or so called" fake news," present in these platforms The results may be better
by applying several ways that are banned in the paper. Entered results suggest that fake
news discovery problems can be addressed with machine literacy styles.
In our modern era where the internet is everywhere, everyone relies on various online
resources for news. Along with the increase in the use of social media platforms like
Facebook, Twitter, etc. news spread rapidly among millions of users within a very short
span of time. The spread of fake news has far-reaching consequences like the creation of
biased opinions to winning election outcomes for the benefit of certain candidates.
Moreover, spammers use appealing news headlines to generate revenue using
advertisements via clickbait.
Although many attempts have been made to solve the problem of fake news, any significant
success is yet to be seen. With huge amounts of data collected from social media websites
like Facebook, Twitter, etc., the best models improve every day. With the use of deep
neural networks, the future work in this field seems a lot more promising. The limitations
that come packaged with this problem is that the data is erratic and this means that any type
of prediction model can have anomalies and can make mistakes. For future improvements,
concepts like POS tagging, word2vec and topic modeling can be utilized. These will give
the model a lot more depth in terms of feature extraction and fine-tuned classification.
4
TABLE OF CONTENT
Abstract 4
1.3 Objective 8
3.2 Flowchart 13
5
CHAPTER 4: REQUIREMENT ANALYSIS 15-17
6.1 Observation 25
CHAPTER 7: CONCLUSION 28
CHAPTER 8: REFERENCES 29
6
CHAPTER 1
INTRODUCTION
1.1Domain Specific
The project aims to develop an advanced Fake News system using cutting-edge
technologies such as machine learning and natural language processing. With the
widespread use of social media and digital platforms for information dissemination, the
rise of fake news has become a significant concern. Misleading information can have
serious consequences, including social polarization, misinformation, and erosion of trust
in credible sources. This project seeks to address these challenges by creating an automated
content verification system that can accurately identify and flag deceptive content.
The proposed system will leverage machine learning algorithms to analyse the linguistic
patterns, context, and semantic features of the content, allowing it to distinguish between
legitimate and deceptive information. Natural language processing techniques will be
employed to extract meaningful insights from textual data and perform sentiment analysis.
The system will also consider metadata and source credibility to enhance its detection
capabilities.
In the discourse of not being able to detect fake news, the world would no longer hold value
in truth. Fake news paves the way for deceiving others and promoting ideologies. These
people who produce the wrong information benefit by earning money with the number of
interactions on their publications. Spreading disinformation holds various intentions, in
particular, to gain favor in political elections, for business and products, done out of spite
or revenge. Humans can be gullible and fake news is challenging to differentiate from the
normal news. Most are easily influenced especially by the sharing of friends and family
due to relations and trust. We tend to base our emotions from the news, which makes
accepting not difficult when it is relevant and stance from our own beliefs. Therefore, we
7
become satisfied with what we want to hear and fall into these traps.
1.3 Objective
Develop an advanced Fake News Detection system: The primary objective of the
project is to design and develop a robust content verification system that can accurately
distinguish between legitimate information and deceptive content, including fake news.
The system will employ state-of-the-art machine learning and natural language
processing techniques to achieve high accuracy in identifying deceptive information.
Employ machine learning and NLP techniques for content analysis: The project aims
to explore and implement various machine learning algorithms and natural language
processing techniques to analyse textual data effectively. Feature extraction, sentiment
analysis, and semantic understanding will be used to extract meaningful insights from
the content for reliable classification.
Evaluate the system's accuracy and performance: The project will assess the
effectiveness of the Fake News Detection system through rigorous evaluation and
testing on diverse datasets. Performance metrics such as precision, recall, and F1-score
will be used to measure the system's ability to correctly identify and flag deceptive
content.
Create a user-friendly interface: The project will focus on developing an intuitive and
user-friendly interface for the detection system. The interface will allow users to verify
the authenticity of content by inputting text or URLs, and the system will provide real-
time results indicating the likelihood of the content being fake.
8
CHAPTER 2
LITERATURE REVIEW
There are two categories of important researches in automatic classification of real and
fake.
In the first category, approaches are at conceptual level, distinction among fake news is
done for three types: serious lies (which means news is about wrong and unreal events or
information like famous rumors), tricks (e.g. providing wrong information) and comics
(e.g. funny news which is an imitation of real news but contain bizarre contents).
In the second category, linguistic approaches and reality considerations techniques are used
at a practical level to compare the real and fake contents. Linguistic approaches try to detect
text features like writing styles and contents that can help in distinguishing fake news. The
main idea behind this technique is that linguistic behaviors like using marks, choosing
various types of words or adding labels for parts of a lecture are rather unintentional, so
they are beyond the author‘s attention. Therefore, an appropriate intuition and evaluation
of using linguistic techniques can reveal hoping results in detecting fake news.
Rubin studied the distinction between the contents of real and comic news via multilingual
features, based on a part of comparative news (The Onion, and The Beaverton) and real
news (The Toronto Star and The New York Times) in four areas of civil, science, trade and
ordinary news. She obtained the best performance of detecting fake news with a set of
features including unrelated, marking and grammar.
Balmas believes that the cooperation of information technology specialists in reducing fake
news is very important. In order to deal with fake news, using data mining as one of the
techniques has attracted many researchers. In data mining based approaches, data
integration is used in detecting fake news. In the current business world, data is an ever-
increasing valuable asset and it is necessary to protect sensitive information from
9
unauthorized people. However, the prevalence of content publishers who are willing to use
fake news leads to ignoring such endeavors. Organizations have invested a lot of resources
to find effective solutions for dealing with clickbait effects.
2. Marco L. Della Vedova et. al. first proposed a novel ML fake news detection method
which, by combining news content and social context features, outperforms existing
methods in the literature, increasing its accuracy up to 78.8%. Second, they implemented
their method within a Facebook Messenger Chabot and validate it with a real-world
application, obtaining a fake news detection accuracy of 81.7%. Their goal was to
classify a news item as reliable or fake; they first described the datasets they used for
their test, then presented the content-based approach they implemented and the method
they proposed to combine it with a social-based approach available in the literature. The
resulting dataset is composed of 15,500 posts, coming from 32 pages (14 conspiracy
pages, 18 scientific pages), with more than 2, 300, 00 likes by 900,000+ users. 8,923
(57.6%) posts are hoaxes 23 and 6,577 (42.4%) are non-hoaxes.
3. Cody Buntain et. al. develops a method for automating fake news detection on Twitter
by learning to predict accuracy assessments in two credibility-focused Twitter datasets:
CREDBANK, a crowd sourced dataset of accuracy assessments for events in Twitter,
and PHEME, a dataset of potential rumors in Twitter and journalistic assessments of
10
their accuracies. They apply this method to Twitter content sourced from BuzzFeed‘s
fake news dataset. They rely on identifying highly retweeted threads of conversation and
use the features of these threads to classify stories, limiting this work‘s applicability only
to the set of popular tweets. Since the majority of tweets are rarely retweeted, this method
therefore is only usable on a minority of Twitter conversation threads. In his paper,
Shivam B. Parikh et. al. aims to present an insight of characterization of news story in
the modern diaspora combined with the differential content types of news story and its
impact on readers. Subsequently, we dive into existing fake news detection approaches
that are heavily based on text-based analysis, and also describe popular fake news
datasets. We conclude the paper by identifying 4 key open research challenges that can
guide future research. It is a theoretical Approach which gives Illustrations of fake news
detection by analyzing the psychological factors.
4. Paper Name: - Fake News Detection on Social Media: A Data Mining Perspective.
Author: - Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang and Huan Liu. In this paper
to detect fake news on social media, a data mining perspective is presented that includes
the characterization of fake news in psychology and social theories. This article looks at
two main factors responsible for the widespread acceptance of fake messages by the user
which is naive realism and confirmatory bias. It proposes a general two-phase data
mining framework that includes 1) feature extraction and 2) modeling, analyzing data
sets, and confusion matrix for detecting fake news.
5. Paper Name: - Media Rich Fake News Detection: A Survey. Author: - Shivam B. Parikh
and Pradeep K. Atrey. Social networking sites read news mainly in three ways: The
(multilingual) text is analyzed with the help of computational linguistics, which
semantically and systematically focuses on the creation of the text. Since most
publications are in the form of text, a lot of work has been done on analyzing them.
Multimedia: Several forms of media are integrated into a single post. This can include
audio, video, images, and graphics. Hyperlinks allow the author of the post to refer to
various sources and thus gain the trust of viewers. In practice, references are made to
other social media websites, and screenshots are inserted.
11
CHAPTER 3
PROCEDURE
The logistic regression model utilizes the logistic function, also known as the sigmoid
function, to transform a linear combination of input features into a range between 0 and 1.
This transformed output is interpreted as the probability of the event occurring. The logistic
regression algorithm aims to find the optimal coefficients for the input features that best fit
the observed data, minimizing the difference between the predicted probabilities and the
actual outcomes.
Interpreting logistic regression involves examining the odds ratio, which quantifies the
likelihood of the event happening. A key advantage of logistic regression lies in its
simplicity, interpretability, and efficiency in handling high-dimensional data. However, it
assumes a linear relationship between the input features and the log-odds of the outcome,
making it essential to preprocess data and assess assumptions. Logistic regression plays a
pivotal role in various fields, including healthcare, finance, and marketing, where
predicting binary outcomes is of paramount importance.
12
3.2 FLOW CHART
4. Splitting of Training and Testing Dataset: In logistic regression, the dataset is split
into training and testing sets to evaluate the model's performance on unseen data,
ensuring its generalization ability and reliability.
5. Machine Learning Algorithm: After training and testing the data, ML algorithms like
Logical Regression is used to predict the value.
13
3.3 METHODOLOGY
1. Collection of datasets: Datasets are collected from various reputable sources, including
news websites, social media platforms, and fact-checking organizations. This ensures a
diverse and comprehensive set of examples of both fake and legitimate news.
3. Pre-processing the data: Data is pre-processed to remove noise, handle missing values,
and normalize text. This includes tokenization, stemming, lemmatization, and stop-word
removal to prepare the data for analysis.
4. Split data into training dataset and testing dataset: The dataset is divided into training
and testing sets, typically using an 80-20 split. The training set is used to develop the model,
while the testing set evaluates its performance and accuracy.
6. Analysing results: The model's performance is analyzed using metrics such as accuracy,
precision, recall, and F1-score. These metrics help in assessing the effectiveness of the
model and identifying areas for improvement
14
CHAPTER 4
REQUIREMENT ANALYSIS
2. SCIKIT-LEARN
15
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software
machine learning library for the Python programming language. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is
designed to interoperate with the Python numerical and scientific libraries NumPy
and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project. Scikit-learn is
largely written in Python, and uses NumPy extensively for high-performance linear
algebra and array operations. Furthermore, some core algorithms are written in
Cython to improve performance. Support vector machines are implemented by a
Cython wrapper around LIBSVM; logistic regression and linear support vector
machines by a similar wrapper around LIBLINEAR. In such cases, extending these
methods with Python may not be possible. Scikit-learn integrates well with many
other Python libraries, such as Matplotlib and plotly for plotting, NumPy for array
vectorization, Pandas data frames, SciPy, and many more.
16
or functions. Functional requirements are implemented by using the system design whereas
system architecture is what is used for implementing the non-functional requirements.
17
CHAPTER 5
IMPLEMENTATION AND ANALYSIS
5.1 Library and Model Used
1. Streamlit: It is a promising open-source Python library, which enables developers to build
attractive user interfaces in no time. Streamlit is the easiest way especially for people with
no front-end knowledge to put their code into a web application: No front-end (html, js,
css) experience or knowledge is required.
2. Numpy: It is core library of scientific computing in python. It provides powerful tools to
deal with various multi-dimensional arrays in python. It is a general purpose array
processing package. Numpy’s main purpose is to deal with multidimensional homogeneous
array. It has tools ranging from array creation to its handling. It makes it easier to create a
n dimensional array just by using np.zeros() or handle its contents using various other
methods such as replace, arrange, random, save, load it also helps I array processing using
methods like sum, mean, std, max, min, all, etc Array created with numpy also behave
differently then arrays created normally when they are operated upon using operators such
as +,-,*,/. All the above qualities and services offered by numpy array makes it highly
suitable for our purpose of handling data. Data manipulation occurring in arrays while
performing various operations need to give the desired results while predicting outputs
require such high operational capabilities.
3. Pandas: It is the most popular python library used for data analysis. It provides highly
optimized performance with back-end source code purely written in C or python. Data in
python can be analysed with 2 ways
∙ Series
∙ Dataframes
Series is one dimensional array defined in pandas used to store any data type. Dataframes
are two dimensional data structure used in python to store data consisting of rows and
columns. Pandas dataframe is used extensively in this project to use datasets required for
training and testing the algorithms. Dataframes makes it easier to work with attributes and
18
results. Several of its inbuilt functions such as replace were used in our project for data
manipulation and preprocessing.
4. re: A regular expression (or RE) specifies a set of strings that matches it; the functions in
this module let you check if a particular string matches a given regular expression (or if a
given regular expression matches a particular string, which comes down to the same thing).
5. nltk: NLTK is widely used by researchers, developers, and data scientists worldwide to
develop NLP applications and analyze text data. One of the major advantages of using
NLTK is its extensive collection of corpora, which includes text data from various sources
such as books, news articles, and social media platforms.
6. sklearn: Sklearn is an open source python library with implements a huge range of
machinelearning, pre-processing, cross-validation and visualization algorithms. It features
various simple and efficient tools for data mining and data processing. It features various
classification, regression and clustering algorithm such as support vector machine, random
forest classifier, decision tree, gaussian naïve-Bayes, KNN to name a few. In this project
we have used sklearn to get advantage of inbuilt classification algorithms like decision tree,
random forest classifier, KNN and naïve Bayes. We have also used inbuilt cross validation
and visualization features such as classification report, confusion matrix and accuracy
score.
Logistic Regression:
19
= ln(p/(1-p)) logit(p)=ln(p/(1-p))= b0+b1X1+b2X2+b3X3+bkXk
Logistic regression is a statistical model that in its basic form uses a logistic function to
model a binary dependent variable, although many more complex extensions exist. In
regression analysis, Logistic regression (or logit regression) is estimating
the paramete
of a logistic model (a form of binary regression). Mathematically, a binary logistic model
has a dependent variable with two possible values, such as pass/fail which is represented
by an indicator variable, where the two values are labeled "0" and "1". In the logistic model,
the log- odds (the logarithm of the odds) for the value labeled "1" is a linear combination
of one or more independent variables ("predictors"); the independent variables can each be
a binary variable (two classes, coded by an indicator variable) or a continuous variable (any
real value). The corresponding probability of the value labeled "1" can vary between 0
(certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function
that converts log-odds to probability is the logistic function, hence the name. The unit of
measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative
names. Analogous models with a different sigmoid function instead of the logistic
function can also be used, such as the probit model; the defining characteristic of the
logistic model is that increasing one of the independent variables multiplicatively scales
the odds of the given outcome at a constant rate, with each independent variable having its
own parameter; for a binary dependent variable this generalizes the odds ratio.
In a binary logistic regression model, the dependent variable has two levels (categorical).
Outputs with more than two values are modeled by multinomial logistic regression and, if
the multiple categories are ordered, by ordinal logistic regression (for example the
proportional odds ordinal logistic model). The logistic regression model itself simply
models probability of output in terms of input and does not perform statistical classification
(it is not a classifier), though it can be used to make a classifier, for instance by choosing a
cutoff value and classifying inputs with probability greater than the cutoff as one class,
20
below the cutoff as the other; this is a common way to make a binary classifier
21
3. Replacing null value with empty string:
4. Text Pre-processing:
22
6. Splitting of Training and Testing Dataset:
GUI made for this project is a simple Streamlit GUI consisting of labels, button, text and
title.
23
Fig: When news is true
24
CHAPTER 6
RESULT AND SCOPE FOR FUTURE WORK
6.1 OBSERVATION
25
1. Enhanced Data Collection
Future work should focus on expanding the dataset to include a broader range of sources
and languages. Incorporating international news and non-English content will make the
system more versatile and capable of detecting fake news across different cultural contexts.
Additionally, continually updating the dataset with recent news articles can help the system
adapt to emerging misinformation trends.
Improving the feature extraction process is crucial. Future research can explore the
inclusion of more sophisticated linguistic and contextual features, such as discourse
analysis and topic modeling. These advanced features can capture the nuances of fake news
more effectively, leading to better classification performance.
Incorporating multi-modal data, such as images and videos, alongside textual analysis, can
significantly enhance fake news detection. Future systems could leverage image
recognition and video analysis techniques to verify the authenticity of multimedia content,
providing a more comprehensive detection capability.
Developing mechanisms for user feedback and interaction can help improve the system
over time. By allowing users to flag potentially fake news and providing explanations for
classification decisions, the system can learn from user input and become more accurate.
This interactive approach can also increase user trust and engagement.
26
5. Robustness Against Adversarial Attacks
Future research should focus on enhancing the system's robustness against adversarial
attacks, where malicious actors intentionally manipulate news articles to evade detection.
Developing techniques to identify and counteract these tactics is essential for maintaining
the integrity of the detection system.
Addressing ethical concerns and potential biases in the detection system is crucial. Future
work should include developing transparent algorithms and ensuring that the system does
not inadvertently censor legitimate news or reflect any biases. Implementing fairness and
accountability measures will help in building a trustworthy and ethical detection system.
Improving the system's ability to detect fake news in real-time and scaling it to handle large
volumes of data is another important area for future development. Optimizing algorithms
for speed and efficiency will enable the system to be deployed in high-traffic environments,
such as social media platforms and news aggregators.
27
CHAPTER 7
CONCLUSION
Due to increasing use of internet, it is now easy to spread fake news. A huge number of
persons are regularly connected with internet and social media platforms. There is no any
restriction while posting any news on these platforms. So some of the people takes the
advantage of these platforms and start spreading fake news against the individuals or
organizations. This can destroy the repute of an individual or can affect a business. Through
fake news, the opinions of the people can also be changed for a political party. There is a
need for a way to detect these fake news. Machine learning classifiers are using for different
purposes and these can also be used for detecting the fake news. The classifiers are first
trained with a data set called training data set. After that, these classifiers can automatically
detect fake news.
The data we used in our work is collected from the Kaggle.com and contains news articles
from various domains to cover most of the news rather than specifically classifying
political news. The learning models were trained and parameter-tuned to obtain optimal
accuracy.
Fake news detection has many open issues that require attention of researchers. For
instance, in order to reduce the spread of fake news, identifying key elements involved in
the spread of news is an important step. Machine learning techniques can be employed to
identify the key sources involved in spread of fake news.
28
CHAPTER 8
REFERENCES
1. M. Granik and V. Mesyura, "Fake news detection using naive Bayes classifier," 2019
IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON),
Kiev, 2019, pp. 900-903.
5. Jadhav, S. S., & Thepade, S. D. (2019). Fake News Identification and Classification
Using DSSM and Improved Recurrent Neural Network Classifier, Applied Artificial
Intelligence, 33(12), 1058-1068, https://doi.org/10.1080/08839514.2019.1661579
6. Kaliyar, R. K., Goswami, A., Narang, P., & Sinha, S. (2020). FNDNet–A deep
convolutional neural network for fake news detection. Cognitive Systems Research, 61,
32-44. https://doi.org/10.1016/j.cogsys.2019.12.005
7. Kaur, S., Kumar, P. & Kumaraguru, P. (2020). Automating fake news detection system
using multi-level voting model. Soft Computing, 24(12), 9049–9069.
https://doi.org/10.1007/s00500-019-04436-y
29