0% found this document useful (0 votes)
15 views

Tracking and Tracing of Fake News Using URL Report-1

Uploaded by

Sivapriya P
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Tracking and Tracing of Fake News Using URL Report-1

Uploaded by

Sivapriya P
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 78

TRACKING AND TRACING OF FAKE NEWS

USING URL

A PROJECT REPORT
Submitted by

DEEPTHI V (113218104024)
GEETA LAKSHMI P (113218104035)
POOJA THANUSHREE K (113218104091)

In partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE ENGINEERING

VELAMMAL ENGINEERING COLLEGE


ANNA UNIVERSITY: CHENNAI 600 025
APRIL 2022

i
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “TRACKING AND TRACING OF FAKE


NEWS USING URL” is the bonafide work of DEEPTHI V (113218104024),
GEETA LAKSHMI P (113218104035), POOJA THANUSHREE K
(113218104091) who carried out the project work under my supervision.

SIGNATURE SIGNATURE
Dr. B. MURUGESHWARI Dr. P. PRITTO
HEAD OF THE PAUL SUPERVISOR
DEPARTMENT Associate Professor
Professor Computer Science and Engineering
Computer Science and Engineering Velammal Engineering College,
Velammal Engineering College, Ambattur-Red hills Road,
Ambattur - Red hills Road, Chennai - 600066.
Chennai - 600066.

ii
ANNA UNIVERSITY: CHENNAI-600025

VIVA VOCE EXAMINATION

The Viva Voce Examination of this project work “TRACKING AND


TRACING OF FAKE NEWS USING URL” is a bonafide record of project done
at the Department of Computer Science and Engineering, Velammal Engineering
College during the academic year 2021 - 2022 by

DEEPTHI V (113218104024)
GEETA LAKSHMI P (113218104035)
POOJA THANUSHREE K (113218104091)

of Final year, Bachelor of Engineering in Computer Science and Engineering


submitted for the university examination on

INTERNAL EXAMINER EXTERNAL EXAMINER

iii
ACKNOWLEDGEMENT

I wish to acknowledge with thanks to the significant contribution given by the


management of our college Chairman, Dr.M.V.Muthuramalingam, and our
Chief Executive Officer Thiru. M.V.M. Velmurugan, for their extensive support.
I would like to thank Dr. S. Satish Kumar, Principal of Velammal
Engineering College, for giving me this opportunity to do this project.
I wish to express my gratitude to our effective Head of the Department,
Dr. B. Murugeshwari, for her moral support and for her valuable innovative
suggestions, constructive interaction, constant encouragement and unending help
that have enabled me to complete the project.
I wish to express my indebted humble thanks to our Project Coordinators,
Dr. S. GunaSundari, Dr. P. Pritto Paul and Dr. S. Rajalakshmi, Department of
Computer Science and Engineering for their invaluable guidance in shaping of this
project.
I wish to express my sincere gratitude to my Internal Guide, Dr. P. Pritto
Paul, Associate Professor, Department of Computer Science and Engineering for
her guidance, without her this projectwould not have been possible.
I am grateful to the entire staff members of the department of Computer
Science and Engineering for providing the necessary facilities and to carry out the
project. I would especially like to thank my parents for providing me with the
unique opportunity to work and for their encouragement and support at all levels.
Finally, my heartfelt thanks to The Almighty for guiding me throughout the life.

iv
ABSTRACT

With the increasing popularity of social media, people have changed the way they
access news. The emergent increase in Fake News due to social media’s extensive
usage is a big problem in today's world. Any background including policy, crime,
health, or Pandemics such as Covid-19 can be linked to fake news. This kind of false
information will complicate the Internet, leading to social disruption and also causes
users to lose their trust in other legitimate online news outlets or sources. To avoid its
outsourcing at the person level, Communities of social media actively work to solve
the issues to avoid the danger posed by misinformation online. Some fake news are so
similar to the real ones that it is difficult for human to identify them. Therefore,
automated fake news detection tools have become a crucial requirement. In this system
we evaluate the fakeness and realness of news using five machine learning models and
two deep learning models with novel stacking model. The simple ensemble machine
learning algorithms like SVM, DT, RF, NB, LR helps to extract different features to
increase the efficiency of system. A hybrid Neural Network architecture, that combines
the capabilities of CNN and Bidirectional LSTM with RNN, is used with two different
dimensionality reduction approaches. They can detect complex patterns in textual data.
LSTM is a tree-structured recurrent neural network used to analyze variable-length
sequential data. This system works for wide diverse of real time links; it ranges from
various online social media like Face book, twitter, Instagram, google sites etc. to fake
blogs, fake websites that deceive the users in one way or the other. This system also
dynamically collects datasets from user. They also allow to report the fake news
producing links to cybercrime. Thus, using the novel stacking approach the
performance of system has increased.

v
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT v

LIST OF FIGURES x

LIST OF ABBREVATIONS xi

1 INTRODUCTION 1

1.1 PURPOSE OF THE PROJECT 1

1.2 SCOPE OF THE PROJECT 2

1.3 DOMAIN INTRODUCTION 2

2
1.3.1 MACHINE LEARNING

1.3.2 DEEP LEARNING 4

2 LITERATURE SURVEY 6

2.1 INTRODUCTION 6
2.2 A NOVEL STACKING APPROACH 6
FOR ACCURATE DETECTION OF
FAKE NEWS
2.3 A SMART SYSTEM FOR FAKE NEWS 7
DETECTION USING MACHINE
LEARNING

vi
2.4 DETECTING FAKE NEWS WITH 7
CAPSULE NEURAL NETWORKS
2.5 FAKE NEWS STANS DETECTION 8
USING DEEP LEARNING
ARCHITECTURE (CNN – LSTM)
2.6 FAKE NEWS DETECTION USING 8
DEEP LEARNING MODELS:
A NOVEL APPROACH
2.7 FNDNet – A DEEP CONVOLUTIONAL 9
NEURAL NETWORK FOR FAKE
NEWS DETECTION

3 SYSTEM ANALYSIS 11

3.1 EXISTING SYSTEM 11

3.1.1 DRAWBACKS 11

3.2 PROPOSED SYSTEM 12

3.2.1 ADVANTAGES 12

3.3 ALGORITHMS 12

3.3.1 RANDOM FORESTRY 12

3.3.2 DECISION TREE 13

3.3.3 NAIVE BAYES 13

3.3.4 K NEAREST NEIGHBOR 14

3.3.5 SUPPORT VECTOR MACHINE 15

3.3.6 CONVOLUTION NEURAL 15

NETWORK

3.3.7 BIDIRECTIONAL LONG SHORT 16

vi
TERM MEMORY

3.3.8 BIDIRECTIONAL RECURRENT 17

NEURAL NETWORK

4 SYSTEM SPECIFICATION 19

4.1 SOFTWARE SPECIFICATION 19

4.2 HARDWARE SPECIFICATION 22

5 SYSTEM DESIGN 23

5.1 ARCHITECTURE DIAGRAM 23

5.2 UML DIAGRAMS 24

5.2.1 USE CASE DIAGRAM 25

5.2.2 CLASS DIAGRAM 26

5.2.3 SEQUENCE DIAGRAM 27

5.2.4 COLLABORATION 28

DIAGRAM

5.2.5 ACTIVITY DIAGRAM 29

6 SYSTEM IMPLEMENTATION 31

6.1 MODULES 31

6.2 MODULE DESCRIPTION 31

6.2.1 DATA PREPROCESSING 31

6.2.2 TRAIN BCNN 32

vi
6.2.3 TRAIN ARIMA 33

6.2.4 ENSEMBLE ALGORITHM 33

6.2.5 REPORTING 33

7 TESTING 34

7.1 INTRODUCTION 34

7.2 TESTING OBJECTIVES 34

7.3 TYPES OF TESTING 35

7.3.1 UNIT TESTING 35

7.3.2 INTEGRATION TESTING 35

8 CONCLUSION AND FUTURE WORK 37

8.1 CONCLUSION 37

8.2 FUTURE ENHANCEMENT 37

APPENDIX I - SOURCE CODE 38

APPENDIX II- SNAPSHOTS 48

REFERNCES 52

TECHNICAL PROJECT OUTCOMES 54

ix
LIST OF FIGURES

F.NO NAME OF THE FIGURE PAGE NO

1.1 Machine Learning 3


1.2 Deep Learning 4
3.1 Convolution Neural Network 16
3.2 Bi – LSTM 17
3.3 Bi – RNN 17
5.1 Architecture Diagram 23
5.2.1 Use Case Diagram 25
5.2.2 Class Diagram 26
5.2.3 Sequence Diagram 27
5.2.4 Collaboration Diagram 28
5.2.5 Activity Diagram 29
6.1 Data Preprocessing 32
9.1 Fake News Link 48
9.2 Realness & Fakeness of Link 48
9.3 Adding to Dataset 49
9.4 Real News Link 49
9.5 Realness & Fakeness of Link 50
9.6 Adding to dataset 50
9.7 Mail Sent Authentication 51
9.8 Mail 51

x
LIST OF ABBREVATIONS

TERMS ABBREVATION

RF Random Forest
DT Decision Tree
NB Naïve Bayes
KNN K Nearest Neighbor
SVM Support Vector Machine
CNN Convolution Neural Network
LSTM Long Short Term Memory
Bi-LSTM Bidirectional Long Short Term Memory
RNN Recurrent Neural Network
TF-IDF Term Frequency Inverse Document Frequency
URL Universal Resource Locator
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
PCA Principal Component Analysis
SVD Singular Value Decomposition
DNN Deep Neural Network
GUI Graphical User Interface
NLP Natural Language Processing
IDE Integrated Development Environment
API Application Programming Interface
UML Unified Modeling Language
CHAPTER 1

INTRODUCTION

1.1 PURPOSE OF THE PROJECT

With the advancement of technology, information is freely accessible to everyone but


the credibility of information depends upon many factors. Enormous amount of
information is published daily via online and it is not easy to tell whether the
information is a true or false. It requires a deep study and analysis of the story, which
includes checking the facts by assessing the supporting sources, by finding original
source of the information or by checking the credibility of authors etc. These, fabricated
information is deliberate attempt with the intent in order to damage/favor an
organization, entity or individual’s reputation or it can be simply with the motive to gain
financially or politically.

There are several approaches to handle the problem of misinformation on social media.
The existing system doesn’t work for the real time data and they lack the credibility as
the efficiency of classifiers are limited to trained datasets alone. Statistical techniques
are used to identify the correlation between various features of the information,
analyzing the originator of the information, analyzing patterns of dissemination.
Machine learning algorithms are used for classification of unreliable content and
analyzing the accounts that share such content.

The goal of this project is to build a model that can produce the realness and fakeness of
any given URL. They are build using novel stacking approach with machine learning
and deep learning algorithms to increase the performance. They also provide way to
report the URL of sites or blogs that provide wrong information. Thus, helping in
eradicating those sites and letting people to have the privilege of knowing right
information.

1
1.2 SCOPE OF THE PROJECT:

Recent advancement in technology has helped to develop software to detect fake


from real information. The scope of this project is very diverse:
 It ranges from various online social media like Face book, twitter, Instagram etc. to
fake blogs, fake websites that deceive the users in one way or the other.
 Web-platform means that the system will be available for access 24/7 except when
there is a temporary server issue which is expected to be minimal.
 They are easily accessible to users as it just needs an URL to check it.
 The admin is allowed to report the fake URLs to cybercrime.

1.3 DOMAIN INTRODUCTION:

1.3.1 MACHINE LEARNING:

Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy. Machine learning is an important component of the
growing field of data science. Through the use of statistical methods, algorithms are
trained to make classifications or predictions, uncovering key insights within data
mining projects.
1. A Decision Process: In general, machine learning algorithms are used to make a
prediction or classification. Based on some input data, which can be labelled or
unlabeled, your algorithm will produce an estimate about a pattern in the data.
2. An Error Function: An error function serves to evaluate the prediction of the
model. If there are known examples, an error function can make a comparison to
assess the accuracy of the model.

2
3. A Model Optimization Process: If the model can fit better to the data points in the

3
training set, then weights are adjusted to reduce the discrepancy between the known
example and the model estimate. The algorithm will repeat this evaluate and
optimize process, updating weights autonomously until a threshold of accuracy has
been met.
SUPERVISED MACHINE LEARNING:

Supervised learning, also known as supervised machine learning, is defined by its use of
labeled datasets to train algorithms that to classify data or predict outcomes accurately.
As input data is fed into the model, it adjusts its weights until the model has been fitted
appropriately. This occurs as part of the cross-validation process to ensure that the
model avoids overfitting or underfitting. Supervised learning helps organizations solve
for a variety of real-world problems at scale, such as classifying spam in a separate
folder from your inbox. Some methods used in supervised learning include neural
networks, naïve bayes, linear regression, logistic regression, random forest, support
vector machine (SVM), and more.

Fig 1.1 Machine Learning


UNSUPERVISED MACHINE LEARNING:

Unsupervised learning, also known as unsupervised machine learning, uses machine


learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover

4
hidden patterns or data groupings without the need for human intervention. Its ability to

5
discover similarities and differences in information make it the ideal solution for
exploratory data analysis, cross-selling strategies, customer segmentation, image and
pattern recognition. It’s also used to reduce the number of features in a model through
the process of dimensionality reduction; principal component analysis (PCA) and
singular value decomposition (SVD) are two common approaches for this. Other
algorithms used in unsupervised learning include neural networks, k-means clustering,
probabilistic clustering methods, and more.

1.3.2 DEEP LEARNING:

Deep learning can be considered as a subset of machine learning. It is a field that is


based on learning and improving on its own by examining computer algorithms. While
machine learning uses simpler concepts, deep learning works with artificial neural
networks, which are designed to imitate how humans think and learn. Deep learning has
aided image classification, language translation, speech recognition. It can be used to
solve any pattern recognition problem and without human intervention.
Artificial neural networks, comprising many layers, drive deep learning. Deep Neural
Networks (DNNs) are such types of networks where each layer can perform complex
operations such as representation and abstraction that make sense of images, sound, and
text. Considered the fastest-growing field in machine learning, deep learning represents
a truly disruptive digital technology, and it is being used by increasingly more
companies to create new business models.

Fig 1.2 Deep Learning


6
Deep learning drives many artificial intelligence (AI) applications and services that
improve automation, performing analytical and physical tasks without human
intervention. Deep learning technology lies behind everyday products and services (such
as digital assistants, voice-enabled TV remotes, and credit card fraud detection) as well
as emerging technologies (such as self-driving cars).
Deep learning neural networks, or artificial neural networks, attempts to mimic the
human brain through a combination of data inputs, weights, and bias. These elements
work together to accurately recognize, classify, and describe objects within the data.
Another process called backpropagation uses algorithms, like gradient descent, to
calculate errors in predictions and then adjusts the weights and biases of the function by
moving backwards through the layers in an effort to train the model. Together, forward
propagation and backpropagation allow a neural network to make predictions and
correct for any errors accordingly. For example,
 Convolutional neural networks (CNNs), used primarily in computer vision and
image classification applications, can detect features and patterns within an image,
enabling tasks, like object detection or recognition. In 2015, a CNN bested a human
in an object recognition challenge for the first time.
 Recurrent neural network (RNNs) are typically used in natural language and
speech recognition applications as it leverages sequential or times series data.

7
CHAPTER 2

LITERATURE SURVEY

2.1 INTRODUCTION:

The task of fake news has been actively researched in recent years. This paper provides
an up-to-date review of fake news recognition research. We first present an overview of
fake news detection and its applications in tracking and tracing. Then, a literature review
of the most recent techniques is presented. Description and limitations of tracking and
tracing of fake news are explained.

2.2 A Novel Stacking Approach for Accurate Detection of Fake News:


Author: tao jiang, jian ping li, amin ul haq, abdus saboor, and amjad ali
Abstract:
With the increasing popularity of social media, people have changed the way they
access news. News online has become the major source of information for people.
However, much information appearing on the Internet is dubious and even intended to
mislead. Some fake news is so similar to the real ones that it is difficult for human to
identify them. Therefore, automated fake news detection tools like machine learning and
deep learning models have become an essential requirement. In this paper, we evaluated
the performance of five machine learning models and three deep learning models on two
fake and real news datasets of different size with hold out cross validation. We also used
term frequency, term frequency-inverse document frequency and embedding techniques
to obtain text representation for machine learning and deep learning models
respectively. To evaluate models’ performance, we used accuracy, precision, recall and
F1-score as the evaluation metrics and a corrected version of McNemar’s test to
determine if models’ performance is significantly different. Then, we proposed our

8
novel stacking model which

9
achieved testing accuracy of 99.94% and 96.05 % respectively on the ISOT dataset and
KDnugget dataset. Furthermore, the performance of our proposed method is high as
compared to baseline methods. Thus, we highly recommend it for fake news detection.

2.3 A smart system for fake news detection using machine


learning Author: Jain.A, A. Shakya, H. Khatter, and A. K.
Gupta Abstract:
Most of the smart phone users prefer to read the news via social media over internet.
The news websites are publishing the news and provide the source of authentication.
The question is how to authenticate the news and articles which are circulated among
social media like WhatsApp groups, Facebook Pages, Twitter and other micro blogs &
social networking sites. It is harmful for the society to believe on the rumors and pretend
to be a news. The need of an hour is to stop the rumors especially in the developing
countries like India, and focus on the correct, authenticated news articles. This paper
demonstrates a model and the methodology for fake news detection. With the help of
Machine learning and natural language processing, it is tried to aggregate the news and
later determine whether the news is real or fake using Support Vector Machine. The
results of the proposed model are compared with existing models. The proposed model
is working well and defining the correctness of results up to 93.6% of accuracy.

2.4 Detecting Fake News with Capsule Neural Networks


Author: Mohammad Hadi Goldani, Saeedeh Momtazi , Reza Safabakhsh.
Abstract:
Fake news is dramatically increased in social media in recent years. This has prompted
the need for effective fake news detection algorithms. Capsule neural networks have
been successful in computer vision and are receiving attention for use in Natural
Language Processing (NLP). This paper aims to use capsule neural networks in the fake
news detection task. We use different embedding models for news items of different
1
lengths.

1
Static word embedding is used for short news items, whereas non-static word
embeddings that allow incremental up-training and updating in the training phase are
used for medium length or large news statements. Moreover, we apply different levels
of n-grams for feature extraction. Our proposed architectures are evaluated on two
recent well-known datasets in the field, namely ISOT and LIAR. The results show
encouraging performance, outperforming the state-of-the-art methods by 7.8% on ISOT
and 3.1% on the validation set, and 1% on the test set of the LIAR dataset.

2.5 Fake News Stance Detection Using Deep Learning


Architecture (CNN-LSTM)
Author: muhammad umer, saleem ullah, arif mehmood, gyu sang choi.
Abstract:
Society and individuals are negatively influenced both politically and socially by the
widespread increase of fake news either way generated by humans or machines. In the
era of social networks, the quick rotation of news makes it challenging to evaluate its
reliability promptly. Therefore, automated fake news detection tools have become a
crucial requirement. To address the aforementioned issue, a hybrid Neural Network
architecture, that combines the capabilities of CNN and LSTM, is used with two
different dimensionality reduction approaches, Principle Component Analysis (PCA)
and Chi- Square. This work proposed to employ the dimensionality reduction techniques
to reduce the dimensionality of the feature vectors before passing them to the classifier.
To develop the reasoning, this work acquired a dataset from the Fake News Challenges
website which has four types of stances: agree, disagree, discuss, and unrelated. The
nonlinear features are fed to PCA and chi-square which provides more contextual
features for fake news detection. The motivation of this research is to determine the
relative stance of a news article towards its headline. The proposed model improves
results by ∼ 4% and ∼ 20% in terms of Accuracy and F1 − score. The experimental
results show that PCA outperforms than Chi-square and state-of-the-art methods with
1
97.8% accuracy.

1
2.6 Fake news detection using deep learning models: A novel approach
Author: S. Kumar, R. Asthana, S. Upadhyay, N. Upreti, and M.
Akbar Abstract:
With the ever increase in social media usage, it has become necessary to combat the
spread of false information and decrease the reliance of information retrieval from such
sources. Social platforms are under constant pressure to come up with efficient methods
to solve this problem because users' interaction with fake and unreliable news leads to
its spread at an individual level. This spreading of misinformation adversely affects the
perception about an important activity, and as such, it needs to be dealt with using a
modern approach. In this paper, we collect 1356 news instances from various users via
Twitter and media sources such as PolitiFact and create several datasets for the real and
the fake news stories. Our study compares multiple state-of-the-art approaches such as
convolutional neural networks (CNNs), long short-term memories (LSTMs), ensemble
methods, and attention mechanisms. We conclude that CNN + bidirectional LSTM
ensembled network with attention mechanism achieved the highest accuracy of 88.78%,
whereas Ko et al tackled the fake news identification problem and achieved a detection
rate of 85%.

2.7 FNDNet-A deep convolutional neural network for fake news


detection Author: R. K. Kaliyar, A. Goswami, P. Narang, and S.
Sinha, Abstract:
With the increasing popularity of social media and web-based forums, the distribution of
fake news has become a major threat to various sectors and agencies. This has abated
trust in the media, leaving readers in a state of perplexity. There exists an enormous
assemblage of research on the theme of Artificial Intelligence (AI) strategies for fake
news detection. In the past, much of the focus has been given on classifying online
reviews and freely accessible online social networking-based posts. In this work, we
1
propose a deep convolutional neural network (FNDNet) for fake news detection.
Instead of relying on

1
hand-crafted features, our model (FNDNet) is designed to automatically learn the
discriminatory features for fake news classification through multiple hidden layers built
in the deep neural network. We create a deep Convolutional Neural Network (CNN) to
extract several features at each layer. We compare the performance of the proposed
approach with several baseline models. Benchmarked datasets were used to train and
test the model, and the proposed model achieved state-of-the-art results with an
accuracy of 98.36% on the test data. Various performance evaluation parameters such as
Wilcoxon, false positive, true negative, precision, recall, F1, and accuracy, etc. were
used to validate the results. These results demonstrate significant improvements in the
area of fake news detection as compared to existing state-of-the-art results and affirm
the potential of our approach for classifying fake news on social media. This research
will assist researchers in broadening the understanding of the applicability of CNN-
based deep models for fake news detection.

1
CHAPTER 3

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM:

On the Internet, there are a few publicly available datasets for Fake news classification
like Buzzfeed News, LIAR, BS Detector etc., B.S. Detector is a browser extension that
warns users about unreliable news sources. It basically searches all links on a given
webpage for references to unreliable sources, checking against a manually compiled list
of domains. It then provides visual warnings about the presence of questionable links or
the browsing of questionable websites.
There are systems that uses either one or few machine learning or deep learning
classifier algorithms to classify between the fake and real news.
Those systems take the news as input to perform the classification process.

3.1.1 DRAWBACKS:

 The existing system lacks accuracy as they use either machine learning or deep
learning technique.
 For example, gradient boosting provides state-of-the-art results and achieved an
accuracy of 86% on Fake News Challenge dataset.
 In deep learning a CNN-based deep neural network called FNDNet and achieved
state- of-the-art results with an accuracy of 98.36% on Kaggle fake news dataset.
 They use the news headlines as input rather than URL.
 There is no reporting platform for reporting the fake news.

1
3.2 PROPOSED SYSTEM:

In this system we use the machine learning and deep learning algorithms to train and test
the models. Deep learning techniques like Bidirectional CNN and LSTM are used to
train the models and the ensemble algorithms of machine learning are used to improve
the accuracy during classification. We used novel stacking method to improve the
individual model performance.
3.2.1 ADVANTAGES:

• The URL from social medias like facebook, twitter and any real time links from
chrome can be used as input where the actual content read from URL are used to
train and test the model.
• They are used to classify the real time Fake and Real news.
• This act as an advantage to report the fake news using those URL.

3.3 ALGORITHMS:

3.3.1 Random Forestry:

Random Forest is an ensemble consisting of a bagging of unpruned decision trees with a


randomized selection of features at each split. Each individual tree in the random forest
produces a prediction and the prediction with the most votes are the final prediction.
According to No Free Lunch theorem: There is no algorithm that is always the most
accurate, thus RF is more accurate and robust than the individual classifiers. The
random forest algorithm can be expressed as

1
where F(x) is the random forest model, j is the target category variable and F is the
characteristic function. To ensure the diversity of the decision tree, the sample selection
of random forest and the candidate attributes of node splitting is randomness.
3.3.2 Decision Tree:

DT is an important supervised learning algorithm. Researchers tend to use tree-based


ensemble models like Random Forest or Gradient Boosting on all kinds of tasks. The
basic idea of DT is that it develops a model to predict the value of a dependent factor by
learning various decision rules inferred from the whole data. Decision Tree has a top-
down structure and shapes like a tree in which a node can only be a leaf node which is
binding with a label class or a decision node which are responsible for making
decisions.
Decision Tree is easily understandable about the process of making the decisions and
predictions. However, it is a weak learner which means it may have bad performance on
small datasets. The key learning process in DT is to select the best attribute. To solve
this problem, various trees have different metrics such as information gain used in ID3
algorithm, gain ratio used in C4.5 algorithm. Suppose discrete attribute A has n different
values and Di is the set which contains all samples that has a value of i in training
dataset
D. The gain ratio and information gain for attribute A can be calculated as follows:

3.3.3 Naïve Bayes:

1
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers"
based on applying Bayes' theorem with strong (naive) independence assumptions
between the

2
features (see Bayes classifier). They are among the simplest Bayesian network models,
but coupled with kernel density estimation, they can achieve high accuracy levels.

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in
the number of variables (features/predictors) in a learning problem. Maximum-
likelihood training can be done by evaluating a closed-form expression, which takes
linear time, rather than by expensive iterative approximation as used for many other
types of classifiers.
3.3.4 K Nearest Neighbor:

K-NN is a well-known algorithm in machine learning. The K-NN procedures are very
simple. Given a test sample, it first finds out k nearest neighbors to this sample based on
a distance measure. Then it predicts class label of the test instance with major vote
strategy. Sometimes classification performance of K-NN is not high mostly because of
curse of dimensionality. K-NN also is a lazy learning algorithm and it can spend a lot
time on classification. The main procedures of K-NN algorithm are given by
Algorithm:
1: for all unlabeled data u do
2: for all labeled data v do
3: compute the distance between u and v
4: find k smallest distances and locate the corresponding labeled instances v1, vk
5: assign unlabeled data u to the label appearing most frequently in the located labeled
instances
6: end for
7: end for
8: End

2
3.3.5 Support Vector Machine:

For binary and multi-classification related problems, SVM is one of the most popular
models. It is a supervised machine learning classifier and many researchers adopted it
for binary and mutli-classification related problems. The instances are separated with a
hyper plane in binary classification problem in such a way w T x + b = 0, where w is a
dimensional coefficient weight vector which is normal to the hyper-plane. The bias term
b, which is the offset values from the origin, and data points are represented by x.
Determining the values of w and b is the main task in SVM. In linear case, w can be
solved using Lagrangian function. On the maximum border, the data points are called
support vectors. As an outcome, the solution of w can be expressed mathematically:

3.3.6 Convolution Neural Network:

CNN is one of the categories of deep learning, mainly used for analyzing ocular
symbolism. It is the standardized version for multifaceted perceptions that includes
convolution and pooling layers. CNN has always been comprehensively involved in
Natural Language Processing and has given substance to be successful in text
classification, semantic analysis, machine translation, and also in some traditional NLP
task. Many experiments have already proven that the CNN model can acquire more
precision than other traditional models.

2
Fig 3.1Convolutional Neural Network

Various word embedding models available such as word2vec, GloVe, or FastText may
be used for converting sentences into sentence matrices. Convolutional filters of
different window sizes are applied to this input embedding layer to generate a
new feature representation. Pooling method is applied on new features and pooled
features from different filters are concatenated with each other to form hidden
representation. These representations are then followed by one or multiple fully
connected layers to make the final prediction.
3.3.7 Bidirectional Long Short-Term Memory:

The Bi-LSTM networks are useful when we need to classify, preprocess or make
forecasting of assumptions based on time series data. The Bi-LSTM is heavily
to learn. All these networks have loops like a chain of recurring models. These recurring
models have a simple structure, which can be termed “tanh” as a single layer. The cell is
structured like an industrial belt where the belt runs straight downward while having
some minor interaction. So, the information passes through without changing. There are
gates to provide information in between; they are denoted by sigma in the given
diagram. The Bi LSTM has three gates to manipulate the behavior of the cell state.

2
Fig 3.2 BI-LSTM
3.3.8 Bidirectional Recurrent Neural Network:

RNN is a sophisticated algorithm for chronological data and is one of the most powerful
and robust neural network models. RNN models are quite popular and successful in
NLP, especially Bi-LSTM, which resolves the fading serial problem. So that it can gain
long- term dependencies. It is the only algorithm that memorizes its input because it has
internal memory, which makes it ideally suitable for machine learning problems. RNNs
are equipped for encoding successive data and are generally reasonable for displaying
short content semantics. The three connection weight matrices are WIH, WHH and
WOH represent the weight corresponding to input, hidden and output vectors
respectively.

Fig 3.3 BI-RNN

2
RNN maintains state information across time steps that allow for processing of variable-

2
length inputs and outputs. In the context of credibility analysis of a news article on the
whole news articles is of variable length. To access the credibility whether a news
article is real or not, a word as token the resultant score of previous state will be
considered as input to current state.

2
CHAPTER 4

SYSTEM SPECIFICATIONS

4.1 SOFTWARE SPECIFICATION:


 Anaconda
 Python 3.8
 Spyder
 Streamlit

4.1.1 ANACONDA:
Anaconda is a free and open-source distribution of the Python and R programming
languages for data science and machine learning related applications (large-scale data
processing, predictive analytics, scientific computing), that aims to simplify package
management and deployment. Package versions are managed by the package
management system conda.
Anaconda is a scientific Python distribution. It has no IDE of its own. Anaconda bundles
a whole bunch of Python packages that are commonly used by people using Python for
scientific computing and/or data science. It provides a single download and an install
program/script that install all the packages in one go.
Alternate is to install Python and individually install all the required packages using pip.
Additionally, it provides its own package manager (conda) and package repository. But
it allows installation of packages from PyPI using pip if the package is not in Anaconda
repositories.
It is especially good if you are installing on Microsoft Windows as it can easily install
packages that would otherwise require you to install C/C++ compilers and libraries if
you were using pip. It is certainly an added advantage that conda, in addition to being a
package manager, is also a virtual environment manager allowing you to install

2
independent development environments and switch from one to the other (similar to
virtualenv).

4.1.2 PYTHON 3.8:


Python is an interpreted, object-oriented, high-level programming with dynamic
semantics. Its high-level built-in data structures, combined with dynamic typing and
binding, make it very attractive for Rapid Application Development, as well as for use
as a scripting or glue language to connect existing components together.
Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the
cost of program maintenance. It supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and
can be freely distributed.
Debugging Python program is easy: a bug or bad input will never cause a segmentation
fault. Instead, when the interpreter discovers an error, it causes an exception. When the
program doesn’t catch the exception, the interpreter prints a stack trace. A source level
debugger allows inspection of local and global variables, evaluation of arbitrary
expressions, setting breakpoints, stepping through the code a line at a time, and so on.
The Features of Python 3.8 had biggest changes, with a few exceptions, Python 3.8
contains many small improvements over the earlier versions. Some of the features are
listed as follows:
 The biggest change in Python 3.8 is the introduction of assignment expressions. They
are written using a new notation (:=). This operator is often called the walrus
operator as it resembles the eyes and tusks of a walrus on its side.
 Positional-Only Arguments -The built-in function float() can be used for
convertin text
g strings and numbers to float objects.
 Python’s typing system is quite mature at this point. However, in Python 3.8, some
new features have been added to typing to allow more precise typing.

2
 Simpler Debugging With f-Strin gwere introduced in Python 3.6, and have become
very popular. They might be the most common reason for Python libraries only being
supported on version 3.6 and later.

4.1.3 SPYDER:
Spyder is an open-source cross-platform development environment (IDE) for scientific
programming in the Python language. Spyder integrates with a number of prominent
packages in the scientific Python stack,
including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and Cython, as well as
other open-source software.
Initially created and developed by Pierre Raybaut in 2009, since 2012 Spyder has been
maintained and continuously improved by a team of scientific Python developers and
the community.
Spyder is extensible with first-party and third-party plugins, includes support for
interactive tools for data inspection and embeds Python-specific code quality assurance
and introspection instruments, such as Pyflakes, Pylint and Rope. It is available cross-
platform through Anaconda, on Windows, on macOS and on major Linux distributions.
Spyder uses Qt for its GUI and is designed to use either of the PyQt or PySide Python
bindings. QtPy, a thin abstraction layer developed by the Spyder project and later
adopted by multiple other packages, provides the flexibility to use either backend.
Some of the features are:
 An editor with syntax highlighting, introspection, code completion
 Support for multiple IPython consoles
 The ability to explore and edit variables from a GUI.
 A Help pane able to retrieve and render rich text documentation on functions, classes
and methods automatically or on-demand.
 A debugger linked for step-by-step execution.

2
4.1.4 STREAMLIT:
Streamlit is a free, open-source, all-python framework that enables data scientists to
quickly build interactive dashboards and machine learning web apps with no front-end
web development experience required.
The platform uses python scripting, APIs, widgets, instant deployment, team
collaboration tools, and application management solutions to help data scientists and
machine learning engineers create python-based applications.
Applications created using Streamlit range from applications capable of real time object
detection, geographic data browsers, deep dream network debuggers, to face-GAN
explorers. Frameworks compatible with Streamlit include: Scikit Learn, Keras, Plotly,
PyTorch, NumPy, Seaborn, TensorFlow, Python, Matplotlib, and Pandas. The goal is to
shift the web app-building philosophy from starting with a layout and developing an
event model, to a Python script-esque top to bottom execution, data flow transformation
style that data scientists should be used to.
Thus, highly trained machine learning engineers that have a unique set of skills actually
end up spending an inordinate amount of their time building tools to understand the vast
amounts of data they have. Streamlit is trying to help them build these tools faster using
the kind of programming tools with which they are used to working.

4.2 HARDWARE SPECIFICATION:

 System : Pentium IV 2.4 GHz.


 Hard Disk : 40 GB.
 Monitor : 15 inch VGA Color.
 Mouse : Logitech Mouse.
 Ram : 512 MB

3
CHAPTER 5

SYSTEM DESIGN

5.1 ARCHITECTURE DIAGRAM

A system architecture diagram would be used to show the relationship between different
components. It is the conceptual model that defines the structure, behavior,
and more views of a system. An architecture description is a formal description and
representation of a system, organized in a way that supports reasoning about the
structures and behaviors of the system. A system architecture canconsist of system
components and the sub-systems developed, that will work togetherto implement the
overall system. After going through the above process, we have successfully enabled the
model to understand the features.

Fig 5.1 Architecture Diagram

3
5.2 UML DIAGRAM

UML, stands for Unified Modeling Language, is a way to visually represent the
architecture, design, and implementation of complex software systems. When
you’re writing code, there are thousands of lines in an application, and it’s difficult
to keep track of the relationships and hierarchies within a software system. UML
diagrams divide that software system into components and subcomponents. UML is
simply anther graphical representation of a common semantic model. UML provides
a comprehensive notation for the full lifecycle of object- oriented development. The
UML diagrams are categorized into structural diagrams, behavioral diagrams, and also
interaction overview diagrams.

ADVANTAGES

1. To represent complete systems (instead of only the software portion)using object-


oriented concepts.
2. To establish an explicit coupling between concepts and executablecode.
3. To take into account the scaling factors that are inherent to complexand critical
systems.
4. To creating a modelling language usable by both humans and machinesUML defines
several models for representing systems.
5. The class model captures the static structure.

6. The state model expresses the dynamic behavior of objects.

7. The use case model describes the requirements of the user.

8. The interaction model represents the scenarios and messages flows.

9. The implementation model shows the work units.

3
5.2.1 USE CASE DIAGRAM:

A use case diagram at its simplest is a representation of a user's interaction with the
system and depicting the specifications of a use case. A use case diagramcan portray
the different types of users of a system and the various ways that they interact with the
system.

Fig 5.2.1 Use Case Diagram

3
5.2.2 CLASS DIAGRAM:

A class diagram in the Unified Modeling Language (UML) is a type of static structure
diagram that describes the structure of a system by showing the system's classes,
their attributes, operations (or methods), and the relationships among objects.

The class diagram is the main building block of object-oriented modeling. It is used for
general conceptual modeling of the structure of the application, and for detailed
modeling translating the models into programming code. Class diagrams can also be
used for data modeling. The classes in a class diagram represent both the main elements,
interactions in the application, and the classes to be programmed.

Fig 5.2.2 Class Diagram

3
5.2.3 SEQUENCE DIAGRAM:

A sequence diagram simply depicts interaction between objects in a sequential order, the
order in which these interactions take place. Sequence diagrams describe how and in
what order the objects in a system function. These diagrams are widely used by
businessmen and software developers to document and understand requirements for new
and existing systems.

Fig 5.2.3 Sequence Diagram

3
5.2.4 COLLABORATION DIAGRAM:

Collaboration diagrams (known as Communication Diagram in UML 2.x) are used to


show how objects interact to perform the behavior of a particular use case,or a part of
a use case. Along with sequence diagrams, collaboration is used by designers to define
and clarify the roles of the objects that perform a particular flow of events of a use
case. They are the primary source of information used to determining class
responsibilities and interfaces.
Communication diagrams offer benefits similar to sequence diagrams, but they will
offer a better understanding of how components communicate and interact with each
other rather than solely emphasizing the sequence of events. They can be a useful
reference for businesses, organizations, and engineers who need to visualize and
understand the physical communications within a program.

Fig 5.2.4 Collaboration Diagram

3
5.2.5 ACTIVITY DIAGRAM:

Activity diagram are graphical representations of workflows of stepwise activities


and actions with support for choice, iteration and concurrency. The activity diagrams
can be used to describe the business and operational step-by-step workflows of
components in a system. Activity diagram consist of Initial node, activity final node
and activities in between.

3
Fig 5.2.5 Activity Diagram

Activity diagram is another important diagram in UML to describe the dynamic


aspects of the system. Activity diagram is basically a flowchart to represent the
flow from one activity to another activity. The activity can be described as an
operation of the system. The control flow is drawn from one operation to another.
This flow can be sequential, branched, or concurrent. Activity diagrams deal with
all type of flow control by using different elements such as fork, join, etc.,
The purpose is to captures the dynamic behavior of the system. Other four
diagrams are used to show the message flow from one object to another but
activity diagram is used to show message flow from one activity to another.

3
CHAPTER 6

SYSTEM IMPLEMENTATION

6.1 MODULES:

 Data Preprocessing
 Train BCNN
 Train Arima
 Ensemble Algorithms
 Reporting

6.2 MODULES DESCRIPTION:


A module is a separate unit of software or hardware. Characteristics of modular
components include portability and interoperability which allows them to function in
another system with the components of other systems.

6.2.1 DATA PREPROCESSING:


Real word news articles, so there are a lot meaningless URLs which carry none
information. Before the data were fed into machine learning and deep learning models,
the text data need to be preprocessed using methods like stop word removal,
tokenization, sentence segmentation, and punctuation removals. Then, the data are fed
into machine learning and deep learning models. The processed data are sent to the train
module.
The operations can significantly help us select the most relevant terms and increase
model performance. In this system we import necessary packages and read the data.
Perform the Tokenization. Do Stemming process where converting text to lower case,

3
removing punctuation, removing special characters, removing extra whitespaces and
removing

4
English stop words are carried out. Using Count Vectorizer the frequency of words in
the text and other features are extracted. Use TF-IDF transformer and feed the data into
the Classifiers for training.

Fig 6.1 Data Preprocessing

6.2.2 TRAIN BCNN:

After data preprocessing the extracted features are feed into the train module. Here we
are using two training models they are train_BCNN and train_Arima using deep
learning techniques. In Train_BCNN once after performing the data cleaning by
applying the data frame function, we obtain the total words present and total number of
unique words.

Now we need to make these clean words into a string. Then we plot those result by
using word cloud for both fake and real news. Create word embeddings for the length of
the document. After this split the data into test and train set then do tokenization and
padding. Add Embedding layer to the result data then we need to add some filters
(bidiretional RNN and CNN). Add dense layers through optimization and then fit the
trained model and predict.

4
6.2.3 TRAIN ARIMA:

To deal with vanishing gradient problem which means when layers increase the neural
network will become untrainable so Bidirectional LSTM and RNN algorithm are used to
train the model. Due to logistic activation function, their computation results range from
0 to 1. It is a class of RNN mainly implemented in the field of machine learning. The Bi-
LSTM network uses a feedback connection mechanism. This is done by combining the
results of two different RNNs layers.

The one layer is processing the sequence in the left to right direction whilst the other
layer is processing the sequence in right to left direction. Here, after applying padding,
create a sequential model. Word embedding, dropout, dense are activated and then the
model is trained with it.

6.2.4 ENSEMBLE ALGORITHMS:

Then to improve the efficiency further we are using some algorithms like K-Nearest
Neighbor, Random Forest, Decision tree, Naive Bayes, Support vector Machine for
analyzing some features. By using five machine learning algorithms, where every
algorithm provides more accuracy over a specific extracted features helps in making the
system more reliable. This allows to work on as many features as possible.

By using these machine learning algorithms, we can improve the training model. If the
predicted value is >0.5 then it is real else considered as fake. We get the accuracy from
the prediction process, build the confession matrix then categorize the data according to
the amount of fakeness or realness.

6.2.5 REPORTING:

On the whole when the prediction is done and if the URL is found to be fake then it goes
to the reporting module. Here, the admin gathers all the fake URLs identified from
dataset and mail the report to cybercrime.

4
CHAPTER 7

TESTING

7.1 INTRODUCTION

Testing is a process used to help identify the correctness, completeness and quality of
developed computer software. With that in mind, testing can never completely establish
the correctness of computer software.
There are many approaches for testing, but effective testing of complex products is
essentially a process of investigation, not merely a matter of creating and following rote
procedure. One definition of testing is "the process of questioning a product in order to
evaluate it", where the "questions" are things, the tester tries to do with the product, and
the product answers with its behavior in reaction to the probing of the tester. Although
most of the intellectual processes of testing are nearly identical to that of review or
inspection, the word testing is connoted to mean the dynamic analysis of the product—
putting the product through its paces.

7.2 TESTING OBJECTIVES


Testing objectives include:

 Testing is a process of executing a program with the intent of finding anerror.


 A good test case is one that has a high probability of finding an as yet undiscovered
error.
 A successful test is one that uncovers an as yet undiscovered error.
 To reduce the level of risk of insufficient software quality.
 To provide sufficient information to stakeholders to allow them to make informed
decisions, especially regarding the level of quality of the test object.
 To verify the fulfillment of all specified requirements.

4
7.3 TYPES OF TESTING:

7.3.1 UNIT TESTING:

Unit testing is a level of software testing where individual units/ componentsof a


software are tested. The purpose is to validate that each unit of the software performs as
designed. A unit is the smallest testable part of any software. It usually has one or a few
inputs and usually a single output.
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be
conducted as two distinct phases.
Test strategy and approach:
Field testing will be performed manually and functional tests will be written in detail.
Test objectives:
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested:
 Verify that the entries are of the correct format.
 No duplicate entries should be allowed.
 All links should take the user to the correct page.

7.3.2 INTEGRATION TESTING

Integration testing (sometimes called integration and testing) is the phase in software
testing in which individual software modules are combined and tested as a group.
Integration testing is conducted to evaluate the compliance of a system or component
with specified functional requirements. It occurs after unit testing and before validation
testing.

4
Integration testing takes as its input modules that have been unit tested, groups them in

4
larger aggregates, applies tests defined in anintegration test plan to those aggregates,
and delivers as its output the integrated system ready for system testing.
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by
interface defects.
The task of the integration test is to check that components or software applications,
e.g., components in a software system or – one step up – software applications at the
company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.

4
CHAPTER 8

CONCLUSION AND FUTURE

WORK

8.1 CONCLUSION

In this proposed model, the realness and fakeness of a news calculated are more accurate
than currently available systems since they overcome the limitations of existing system
using many classifiers to extract different features. This system is easy to use by just
pasting the URL that are suspicious and are friendlier to report. They rather work for
any real time datasets whereas other systems work only on trained datasets. The
software is trained dynamically each time the user provides a URL to check. The
software is implemented and the output is studied. The achieved results demonstrate the
appropriateness of the proposed system to perform the classification between real and
fake news.

8.2 FUTURE ENHANCEMENT

In future this system can be expanded by:


 Designing them to work for different languages. The current system only works for
English language but they can also be trained in other languages too, making them
work multilingual.
 They can also be enhanced with a feature to track the IP address of the person
posting those fake blogs or post on the internet. Thus, allowing to eradicate the origin
of the fake news.
 We will also try to use more different machine learning and deep learning models for
fake news link detection.

4
APPENDIX -1

SOURCE CODE

DATA PREPROCESSING:

df_real=pd.read_csv('real_news_content_URL.csv')
df_real.shape
df_fake=pd.read_csv('fake_news_content_URL.csv')
df_fake.shape
df=pd.concat([df_real,df_fake],axis=0)
df.shape
df['news_type']=df['id'].apply(lambda x: x.split('_')[0])
print("",df.head(2))
df.shape
df.info()
df.describe()
df.drop(['id','url', 'authors','publish_date','canonical_link','meta_data'],axis=1,
inplace=True)
df.isnull().sum()
(df.isnull().sum())/(df.shape[0])*100
df['contain_movies']=df['movies'].apply(lambda x: 0 if str(x)=='nan' else
1) df['contain_images']=df['images'].apply(lambda x: 0 if str(x)=='nan' else
1) df.drop(['movies','images'],axis=1,inplace=True)
print("",df.head(2))
for x in df[df['news_type']=='Fake']['source'].unique():
if x in df[df['news_type']=='Real']['source'].unique():
new.append(x)

4
print(new)
df['common']=df['source'].apply(lambda x: x if x in new else
0) df1=df[df['common']!=0]
ps=PorterStemmer()
wst= WhitespaceTokenizer()
def lower_func (x):
return x.lower()
def remove_number_func (x):
new=""
for a in x:
if a.isdigit()==False:
new=new+a
return new
def remove_punc_func(x):
new=''
for a in x:
if a not in
string.punctuation:
new=new+a
return new
def remove_spec_char_func(x):
new=''
for a in x:
if (a.isalnum()==True) or (a==' '):
new=new+a
return(new)
def remove_stopwords(x):
new=[]
4
for a in x.split():

5
if a not in stopwords.words('english'):
new.append(a)
return " ".join(new)
def stem_func(x):
wordlist = word_tokenize(x)
psstem = [ps.stem(a) for a in wordlist]
return ' '.join(psstem)
def remove_whitespace_func(x):
return(wst.tokenize(x))
def compose(f, g):
return lambda x: f(g(x))
final=compose(compose(compose(compose(compose(compose(remove_whitespace_fun
c,stem_func),remove_stopwords),remove_spec_char_func),remove_punc_func),remove
_number_func),lower_func)
df_fake=df[df['news_type']=='Fake']
cv1 = CountVectorizer(analyzer=final)
cv1.fit(df_fake['title'])
bow1=cv1.transform(df_fake['title'])
pd.DataFrame(bow1.todense()).shape
X1=df['text']
y1=df['news_type']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3,
random_state=42)
pp=Pipeline([ ('bow',CountVectorizer(analyzer=final
)), ('tfidf',TfidfTransformer()),
('classifier',RandomForestClassifier())
])

5
pp.fit(X1_train,y1_train)

TRAIN BCNN:

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(df.clean_joined, df.isfake, test_size =
0.2) from nltk import word_tokenize
from keras import regularizers
tokenizer = Tokenizer(num_words = total_words)
tokenizer.fit_on_texts(x_train)
train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)
len(train_sequences)
len(test_sequences)
train_sequences
print("The encoding for document\n",df.clean_joined[0],"\n is : ",train_sequences[0])
padded_train = pad_sequences(train_sequences,maxlen = maxlen, padding = 'post',
truncating = 'post')
padded_test = pad_sequences(test_sequences,maxlen = maxlen, truncating = 'post')
for i,doc in enumerate(padded_train[:2]):
print("The padded encoding for document",i+1," is : ",doc)
batch_size = 32
embedding_dims = 100
filters = 32
kernel_size = 5
model = Sequential()
model.add(Embedding(total_words, output_dim = 128))

5
# model.add(Embedding(total_words, output_dim = 240))
model.add(Conv1D(filters, kernel_size, padding='valid', activation='elu',
activity_regularizer=regularizers.l2(0.01)))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1,activation= 'sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['acc']) model.summary()
total_words
y_train = np.asarray(y_train)
model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)
pred = model.predict(padded_test)

TRAIN ARIMA:

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(df.clean_joined, df.isfake, test_size =
0.2) from nltk import word_tokenize
tokenizer = Tokenizer(num_words = total_words)
tokenizer.fit_on_texts(x_train)
train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)
len(train_sequences)
len(test_sequences)
train_sequences
print("The encoding for document\n",df.clean_joined[0],"\n is : ",train_sequences[0])
padded_train = pad_sequences(train_sequences,maxlen = maxlen, padding = 'post',
truncating = 'post')
padded_test = pad_sequences(test_sequences,maxlen = maxlen, truncating = 'post')
5
for i,doc in enumerate(padded_train[:2]):
print("The padded encoding for document",i+1," is : ",doc)
model = Sequential()
model.add(Embedding(total_words, output_dim = 128))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1,activation= 'sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['acc']) model.summary()
total_words
y_train = np.asarray(y_train)
model.fit(padded_train, y_train, batch_size = 64, validation_split = 0.1, epochs = 2)
pred = model.predict(padded_test)
prediction = []
for i in range(len(pred)):
if pred[i].item() > 0.5:
prediction.append(1)
else:
prediction.append(0)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(list(y_test),
prediction) print("Model Accuracy : ", accuracy)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(list(y_test), prediction)
plt.figure(figsize = (25, 25))
sns.heatmap(cm, annot = True)
category = { 0: 'Fake News', 1 : "Real News"}

5
PREDICTION:

def main():
vectorizer_lr, vectorizer_nn, loaded_model_lr, loaded_model_nn = download_model()
glove = download_files(glove_vect_size=300)
input_X, input_X_nn =
featurize_data_pair(url,html,vectorizer_lr,vectorizer_nn,glove,glove_vect_size=300)
y_output = loaded_model_lr.predict(input_X)[0]
probs=loaded_model_lr.predict_proba(input_X)
output_nn = loaded_model_nn.predict(np.array(input_X_nn))
prediction = loaded_model_nn.predict_classes(np.array(input_X_nn))
label = 0
if dataset_button:
if not news_url.isspace():
if radio_label == 'yes':
label=1
get_data().append({'url': news_url, 'label':label})
#if st.button("add to the dataset, use this button only if selection is 'no'
"): elif radio_label == 'no':
label = 0
get_data().append({'url': news_url, 'label':label})
else:
st.header('Fuck you!')
dataframe = pd.DataFrame(get_data())
with st.container():
st.write(dataframe)
def get_data_pair(url):
if not url.startswith('http'):
5
url = 'http://' + url
url_pretty = url
if url_pretty.startswith('http://'):
url_pretty = url_pretty[7:]
if url_pretty.startswith('https://'):
url_pretty = url_pretty[8:]
def get_description_from_html(html):
soup = bs(html)
description_tag = soup.find('meta', attrs={'name':'og:description'}) or soup.find('meta',
attrs={'property':'description'}) or soup.find('meta', attrs={'name':'description'})
if description_tag:
description = description_tag.get('content') or ''
else: # If there is no description, return empty string.
description = ''
return description
def get_descriptions_from_data(data):
# A dictionary mapping from url to description for the websites
in # train_data.
descriptions = []
for site in tqdm(data):
url, html, label = site
descriptions.append(get_description_from_html(html))
return descriptions
def combine_features(X_list):
return np.concatenate(X_list, axis=1)
def dict_to_features(features_dict):
X = np.array(list(features_dict.values())).astype('float')
X = X[np.newaxis, :]
5
return X
def featurize_data_pair(url, html,vectorizer, vectorizer_nn, glove, glove_vect_size):
# domain check and keywords count features
keyword_X = dict_to_features(keyword_featurizer(url, html))
# bag of words
description = get_description_from_html(html)
bow_X = vectorize_data_descriptions([description],vectorizer)
bow_X_nn = vectorize_data_descriptions([description],vectorizer_nn)
# glove
glove_X = glove_transform_data_descriptions([description], glove, glove_vect_size)
X = combine_features([keyword_X, bow_X, glove_X])
X_nn = combine_features([keyword_X, bow_X_nn, glove_X])
return X, X_nn
def download_model():
with open("description.pkl", 'rb') as file:
ref_desc_pickle = pickle.load(file)
with open("LR_model.pkl", 'rb') as file:
loaded_model = pickle.load(file)
def download_files(glove_vect_size):
VEC_SIZE = glove_vect_size
glove = GloVe(name='6B', dim=VEC_SIZE)
return glove

REPORTING:

import datetime
import smtplib
5
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders
'''# The mail addresses and password
sender_address = '[email protected]'
sender_pass = 'RPondy###0000'
receiver_address = '[email protected]'
# Setup the MIME
# Create SMTP session for sending the mail
session = smtplib.SMTP('smtp.gmail.com', 587) # use gmail with port
session.starttls() # enable security
session.login(sender_address, sender_pass) # login with mail_id and password
text = message.as_string()
session.sendmail(sender_address, receiver_address, text)
session.quit()
print('Mail Sent')

5
APPENDIX -

SNAPSHOTS

Fig 9.1: Fake news link

5
Fig 9.2: Realness and Fakeness of link

6
Fig 9.3: Adding to dataset

Fig 9.4: Real news link

6
Fig 9.5: Realness and Fakeness of link

Fig 9.6: Adding to dataset

6
Fig 9.7: Mail sent authentication

Fig 9.8: Mail

6
REFERENCES

[1] T Jiang, J Li, A Haq, A Saboor, A Ali, “A Novel Stacking Approach for Accurate
Detection of Fake News”, IEEE 2021.
[2] F. T. Asr and M. Taboada, “Misinfotext: a collection of news articles, with false and
true labels,” 2019.
[3] H. Jwa, D. Oh, K. Park, J. M. Kang, and H. Lim, “ex BAKE: automatic fake news
detection model based on bidirectional encoder representations from transformers
(bert),” Applied Sciences, vol. 9, no. 19, 2019.
[4] Cardoso Durier da Silva, F., Vieira, R., & Garcia, A. C., “Can machines learn to
detect fake news? A survey focused on social media”, In Proceedings of the 52 nd
Hawaii International Conference on System Sciences, January 2019.
[5] Jain.A, A. Shakya, H. Khatter, and A. K. Gupta, ‘‘A smart system for fake news
detection using machine learning,’’ in Proc. Int. Conf. Issues Challenges Intell.
Comput. Techn. (ICICT), vol. 1, Sep. 2019.
[6] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapè, ‘‘MIMETIC: Mobile encrypted
traffic classification using multimodal deep learning,’’ Comput.Netw., vol. 165, Dec.
2019.
[7] R. K. Kaliyar, A. Goswami, and P. Narang, ‘‘Multiclass fake news detection using
ensemble machine learning,’’ in Proc. IEEE 9th Int. Conf. Adv. Comput. (IACC),
Dec. 2019.
[8] L. Borges, B. Martins, and P. Calado, ‘‘Combining similarity features and deep
representation learning for stance detection in the context of checking fake news,’’ J.
Data Inf. Qual., vol. 11, no. 3, pp. 1–26, Jul. 2019.
[9] M. H. Goldani, S. Momtazi, and R. Safabakhsh, ‘‘Detecting fake news with capsule
neural networks,’’ 2020.
[10] M. Umer, Z. Imtiaz, S. Ullah, A. Mehmood, G. S. Choi, and B.-W. On, ‘‘Fake
news stance detection using deep learning architecture (CNN & LSTM),’’ IEEE
6
Access, vol.

6
8, pp. 156695–156706, 2020.
[11] S. Kumar, R. Asthana, S. Upadhyay, N. Upreti, and M. Akbar, ‘‘Fake news
detection using deep learning models: A novel approach,’’ Trans.
Emerg.Telecommun. Technol., vol. 31, no. 2, p. e3767, Feb. 2020.
[12] X. Zhang and A. A. Ghorbani, ‘‘An overview of online fake news:
Characterization, detection, and discussion,’’ Inf. Process. Manage., vol. 57, no. 2,
Mar. 2020.
[13] Zhang, Jiawei, Bowen Dong, and S. Yu Philip. "Fake detector: Effective fake news
detection with deep diffusive neural network “, IEEE 36th International Conference
on Data Engineering (ICDE), IEEE - 2020.
[14] K Ludwig, J. Pragmat, “Dissemination and uptake of fake-quotes in lay political
discourse on Facebook and Twitter”, M Creation 157, 101–118, 2020.
[15] R. K. Kaliyar, A. Goswami, P. Narang, and S. Sinha, ‘‘FNDNet—A deep
convolutional neural network for fake news detection,’’ Cognit. Syst. Res., vol.61,
pp. 32–44, Jun. 2020.
[16] A. U. Haq, J. P. Li, M. H. Memon, J. Khan, A. Malik, T. Ahmad, A. Ali,S. Nazir, I.
Ahad, and M. Shahid, ‘‘Feature selection based on L1-norm support vector machine
and effective recognition system for Parkinson’s disease using voice recordings,’’
IEEE Access, vol. 7, pp. 37718–37734, 2019.
[17] J. C. S. Reis, A. Correia, F. Murai, A. Veloso, F. Benevenuto, and E. Cambria,
‘‘Supervised learning for fake news detection,’’ IEEE Intell. Syst., vol. 34, no. 2, pp.
76–81, Mar. 2019.
[18] Xinyi Zhou and Reza Zafarani, “A Survey of Fake News: Fundamental Theories,
Detection Methods, and Opportunities”, ACM Comput. Surv. 1, 1, Article 1, (2020).

6
TECHNICAL PROJECT OUTCOMES

After successful completion of the project work student will be able to:

TP01: Analyze, Design and Implement projects with a comprehensive, systematicand


ethical approach.
TP02: Apply modern tools to execute and integrate modules in the project.

TP03: Apply techniques for societal, health care, and real time sustainableresearch
projects.
TP04: Develop communication skills by the technical presentation activities.
TP05: Contribute as a team and lead the team in managing technical
projects.

Mapping of Technical Project Outcomes with the Project titled


“TRACKING AND TRACING OF FAKE NEWS USING URL”.

TP01 TP02 TP03 TP04 TP05


TRACKING AND TRACING OF
FAKE NEWS USING URL 3 3 3 3 3

(Indicate as 1 - Less than 30%; 2 - 30.1 - 60%; 3 - Above 60.1%)

You might also like