Veeresh Internship Report
Veeresh Internship Report
A INTERNSHIP REPORT ON
“FAKE NEWS PREDICTION USING
MACHINE LEARNING”
CERTIFICATE
Signature of Guide
Mrs. Anitha J
Associate Professor,
Department of Master of Computer Application,
Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY
Signature of Student
VEERESH A C
(1DA22MC051)
Fake News Detector 5
ACKNOWLEDGEMENT
I likewise thank all the faculty individuals and my friends for the help and
consolation I might want to thank my parents and our family individuals who
upheld and helped me in finishing the Internship.
Place: VEERESH A
Date: (1DA22MC051)
Fake News Detector 6
Abstract
With the recent social media boom, the spread of fake news has become a great
concern for everybody. It has been used to manipulate public opinions, influence
the election - most notably the US Presidential Election of 2016, incite hatred and
riots like the genocide of the Rohingya population. A 2018 MIT study found that
fake news spreads six times faster on Twitter than real news. The credibility and
trust in the news media are at an all-time low. It is becoming increasingly difficult
to determine which news is real and which is fake. Various machine learning
methods have been used to separate real news from fake ones. In this study, we
tried to accomplish that using Passive Aggressive Classifier, LSTM and natural
language processing. There are lots of machine learning models but these two
Now there is some confusion present in the authenticity of the correctness. But it
definitely opens the window for further research. There are some of the aspects that
has to be kept in mind considering the fact that fake news detection is not only a
simple web interface but also a quite complex thing that includes a lot of backend
work.
Fake News Detector 7
Table of Content
Declaration of Authorship 2
Acknowledgement 3
Abstract 4
Table of Content 5
Introduction 6
2. Problem Statement 7
3. Motivation 8
4. Background Study 11
5. Feasibility Study 13
6. Methodology 14
6.1 The Dataset 14
6.2 The Machine Learning Model 15
6.3 The Web Interface 18
6.4 Common Platform: Flask 18
7. Implementation 19
7.1 The Interface 19
7.2 The ML Model 21
7.3 Flask Code 34
7.4 Web Interface 37
8. Key Insights 42
9. Conclusion 43
10. Future Work 44
11. References 45
Fake News Detector 8
1. Introduction
Fake news is untrue information presented as news. It often has the aim of
revenue. Once common in print, the prevalence of fake news has increased with
the rise of social media, especially the Facebook News Feed. During the 2016 US
presidential election, various kinds of fake news about the candidates widely
spread in the online social networks, which may have a significant effect on the
networks account for more than 41.8% of the fake news data traffic in the election,
which is much greater than the data traffic shares of both traditional TV/radio/print
medium and online search engines respectively. Fake news detection is becoming
increasingly difficult because people who have ill intentions are writing the fake
pieces so convincingly that it is difficult to separate from real news. What we have
done is a simplistic approach that looks at the news headlines and tries to predict
Fake news can be intimidating as they attract more audience than normal. People
use them because this can be a very good marketing strategy. But the money
earned might not live upto fact that it can harm people.
Fake News Detector 9
2. Problem Statement
In this day and age, it is extremely difficult to decide whether the news we come
across is real or not. There are very few options to check the authenticity and all of
them are sophisticated and not accessible to the average person. There is an acute
need for a web-based fact-checking platform that harnesses the power of Machine
3. Motivation
Social media facilitates the creation and sharing of information that uses computer-
mediated technologies. This media changed the way groups of people interact and
information to them. The majority of people search and consume news from social
media rather than traditional news organizations these days. On one side, where
social media have become a powerful source of information and bringing people
together, on the other side it also 1 put a negative impact on society. Look at some
de-contextualized videos, and audio jokes were used for campaigning. These kinds
of stuff went viral on the digital platform without monitoring their origin or reach.
A nationwide block on major social media and messaging sites including Facebook
and Instagram was done in Sri Lanka after multiple terrorist attacks in the year
2019. The government claimed that “false news reports” were circulating online.
This is evident in the challenges the world's most powerful tech companies face in
reducing the spread of misinformation. Such examples show that Social Media
Millions of news articles are being circulated every day on the Internet – how one
can trust which is real and which is fake? Thus incredible or fake news is one of
the biggest challenges in our digitally connected world. Fake news detection on
social media has recently become an emerging research domain. The domain
focuses on dealing with the sensitive issue of preventing the spread of fake news
on social media. Fake news identification on social media faces several challenges.
fake news manually. Since they are intentionally written to mislead readers, it is
emerging and time-bound news as they are not sufficient to train the application
features and develop authentic information dissemination systems are some useful
domains of research and need further investigations. If we can’t control the spread
of fake news, the trust in the system will collapse. There will be widespread
Fake News Detector 12
distrust among people. There will be nothing left that can be objectively used. It
means the destruction of political and social coherence. We wanted to build some
sort of web-based system that can fight this nightmare scenario. And we made
4. Background Study
tweet in different situations. [2] used LSTM in a similar problem of early rumour
detection. In another work, [3] aimed at detecting the stance of tweets and
determining the veracity of the given rumour with convolution neural networks. A
submission [4] to the SemEval 2016 Twitter Stance Detection task focuses on
boosted decision tree. Though this work seems to be similar to our work, the
attempt, a team [6] concatenated various features vectors and passed them through
method and robust to noise. It can be used in fake news detection [16] Term
numerical statistic that shows how important a word is to news in a news dataset.
The importance of a word is proportional to the number of times the word appears
in the news (fake and real) but inversely proportional to the number of times the
5. Feasibility Study
detection. Bi-directional LSTM was used in [7] to detect fake news. It had
reasonably good accuracy but if the news was a bit more sophisticated, it would be
sensational/clickbaity words as part of fake news. For example, if a news title says,
‘Donald Trump is the greatest president ever, the model will pick it up as fake
news with reasonable accuracy. If the title is more nuanced and written in a
sophisticated way, it’d be difficult to do so. We believe that our LSTM model is not
enough by itself to detect fake news. That’s why we included passive aggressive
classifier with it and when we compared passive news with reputable news
sources, but the scope of the work is so vast that we couldn’t do it with the
resources available to us. Our model can act as a first step in detecting fake news.
6. Methodology
Figure 1 : Dataset
The dataset is simple. It contains the titles of the news, the body text and a label
field, which, if the news is authentic, shows REAL and if inauthentic, shows
FAKE.
◦ The common platform that brings the model and the interface together.
There are two parts to the ML Model building. Machine Learning is a part of our
life that can help us in predicting. We are using two types of model in this case. For
the first part, we used passive-aggressive classifiers. And the steps include:
1. Data Loading: We are loading a CSV file for the data sorting and training-
testing part of the model. The CSV file is turned into an array for easier
work purpose.
used often.
change in the model. But if there is any kind of change in the model, that is
if the prediction is not correct then the aggressive part is called, which
changes the model accordingly. The aggressive part of the model changes
4. Model Building: The model is built through the train and test of the dataset,
by ensuring that the training is done for 80% of the dataset and testing is
1. Loading the data: For this step, it is the same as the passive-aggressive
one.
Fake News Detector 19
2. Scanning and parsing. Data is loaded from a CSV file. This consists of
the body of selected news articles. It then contains a label field that indicates
whether the news is real or fake. In this code block, we scan the CSV and
out infrequent words. This allows us to generate sequences for our training
is used to extract the semantic information from the words in each title.
5. Model Building: Building the model and finding out the accuracy via
Dropout, and Dense layers. We are going to run the data on 20 epochs.
1. HTML for building the basic skeleton: HTML makes the structure of the
web application and also there are some of the functions that can be
2. CSS for design: The CSS part is for designing only. Because it will give a
This acts as a common platform and takes the input with the pickle module and
passes it to the machine learning model afterwards the prediction is shown on the
3. Using the Pickle module for serializing and de-serializing the dataset.
4. Providing output.
Fake News Detector 21
7. Implementation
This is what you see when you go to the web interface. You are supposed to copy
When you paste the news on the input box and click ‘Predict’ the model will give
you the result. If the news seems authentic, the output will be ‘Looking Real
News’. Otherwise, it will show ‘Looking Fake News’. That’s how you can detect
is basically a ratio of the number of times a particular word appears with respect to
the total number of word. And Inverse Document Frequency is basically the weight
of a rare word.
TfidfVectorizer
vectorization = TfidfVectorizer()
vectorization.fit(text)
print(vectorization.idf_)
print(vectorization.vocabulary_)
Words that are present in every data will have very low IDF value and using that
example = text[0]
Fake News Detector 24
example
example = vectorization.transform([example])
print(example.toarray())
Passive is used when the prediction is correct and there is no change in the model.
But if there is any kind of change in the model that is if the prediction is not correct
import os
os.chdir("D:\Books\Fake_News_Detection-master")
OS module is used for the Python program to interact with the operating system
import pandas as pd
dataset = pd.read_csv('news.csv')
dataset.head()
x = dataset['text']
y = dataset['label']
TfidfVectorizer
PassiveAggressiveClassifier
confusion_matrix
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=0)
y_train
y_train
vectorization =
TfidfVectorizer(stop_words='english',max_df=0.7)
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)
max_df refers to the percentage of the repetition of the word. 0.7 means 70% of the
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(xv_train,y_train)
y_pred = classifier.predict(xv_test)
Fake News Detector 26
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
cf = confusion_matrix(y_test,y_pred,
labels=['FAKE','REAL'])
print(cf)
def fake_news_det(news):
input_data = [news]
vectorized_input_data =
vectorization.transform(input_data)
prediction =
classifier.predict(vectorized_input_data)
print(prediction)
fake_news_det("""Go to Article
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))
def fake_news_det1(news):
input_data = [news]
vectorized_input_data =
vectorization.transform(input_data)
prediction =
classifier.predict(vectorized_input_data)
print(prediction)
terrorism.""")
In this project, titles of news articles found on the internet is used to determine
whether a news is fake or real. We are using LSTM to help classify them into either
import numpy as np
import pandas as pd
import json as j
import urllib
import gzip
import nltk
nltk.download('stopwords')
ModelCheckpoint
BatchNormalization
Data scanning and parsing : Data is loaded from a csv file fake_or_real_news.csv.
This consists of the title and text of a select group of news articles. It then contains
a label field which indicates whether the news is real or fake. In this code block,
we scan the csv and clean the titles to filter out stop words and punctuation.
import re
import string
Fake News Detector 30
CountVectorizer
def clean_text(text):
text = str(text)
text = text.split()
words = []
exclude = set(string.punctuation)
exclude)
if word in stops:
continue
try:
words.append(ps.stem(word))
except UnicodeDecodeError:
words.append(word)
return text.lower()
Fake News Detector 31
stops = set(stopwords.words("english"))
ps = PorterStemmer()
f = pd.read_csv('news.csv')
We take the news titles and divide the train and test set. We also clean the text.
f = f[1:100]
X_cleaned_train[0]
Tokenizer : Tokenizer is used to assign indices to words, and filter out infrequent
words. This allows us to generate sequences for our training and testing data.
import tokenize
MAX_NB_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
Fake News Detector 32
tokenizer.fit_on_texts(X_cleaned_train +
X_cleaned_test)
train_sequence =
tokenizer.texts_to_sequences(X_cleaned_train)
test_sequence =
tokenizer.texts_to_sequences(X_cleaned_test)
EMBEDDING_FILE =
'https://s3.amazonaws.com/dl4j-distribution/GoogleNews
-vectors-negative300.bin.gz'
Word2Vec =
KeyedVectors.load_word2vec_format(EMBEDDING_FILE,
binary=True)
Fake News Detector 33
word_index = tokenizer.word_index
try:
embedding_vector = word2vec.word_vec(word)
embedding_matrix[i] = embedding_vector
continue
Building the Model : The model is created using an Embedding layer, LSTM,
Dropout, and Dense layers.We are going to run the data on 20 epochs.
MaxPooling1D
kVECTORLEN = 50
model = Sequential()
model.add(Dropout(0.4))
model.add(Dense(1, activation='relu'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
print(model.summary())
train_sequence =
sequence.pad_sequences(train_sequence, maxlen=50)
test_sequence = sequence.pad_sequences(test_sequence,
maxlen=50)
batch_size=64)
verbose=0)
Fake News Detector 35
accuracy = (scores[1]*100)
print("Accuracy: {:.2f}%".format(scores[1]*100))
Analyzing the Data: The graphs below demonstrate the change in accuracy and
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()
Fake News Detector 36
Fake News Detector 37
TfidfVectorizer
PassiveAggressiveClassifier
import pickle
import pandas as pd
app = Flask(__name__)
vectorization = TfidfVectorizer(stop_words='english',
max_df=0.7)
dataset = pd.read_csv('news.csv')
x = dataset['text']
y = dataset['label']
y, test_size=0.2, random_state=0)
Fake News Detector 38
def fake_news_det(news):
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)
input_data = [news]
vectorized_input_data =
vectorization.transform(input_data)
prediction =
loaded_model.predict(vectorized_input_data)
return prediction
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
if request.method == 'POST':
message = request.form['message']
pred = fake_news_det(message)
print(pred)
Fake News Detector 39
return render_template('index.html',
prediction=pred)
else:
return render_template('index.html',
if __name__ == '__main__':
app.run(debug=True)
Fake News Detector 40
<html>
<head>
<meta charset="UTF-8">
<link
href='https://fonts.googleapis.com/css?family=Pacifico
<link
href='https://fonts.googleapis.com/css?family=Arimo'
rel='stylesheet' type='text/css'>
<link
href='https://fonts.googleapis.com/css?family=Hind:300
<link
href='https://fonts.googleapis.com/css?family=Open+San
initial-scale=1">
<style>
width: 50%;
padding: 10px;
border-radius: 1px;
box-sizing: border-box;
margin-top: 6px;
margin-bottom: 16px;
resize: horizontal;
button {
background-color: #4CAF50;
color: white;
margin: 8px 0;
border: none;
Fake News Detector 42
cursor: pointer;
width: 50%;
button:hover {
opacity: 0.8;
h1 {
text-align: center;
p {
text-align: center;
div {
text-align: center;
body {
Fake News Detector 43
</style>
</head>
<body>
<div class="login">
Nahreen(160041028) </h1>
method="POST">
required="required" style="font-size:
18pt"></textarea>
<br> </br>
Fake News Detector 44
btn-block btn-large">Predict</button>
<div class="results">
{% if prediction == ['FAKE']%}
<h2 style="color:red;">Looking
Spam⚠News📰
</h2>
News📰</b></h2>
{% endif %}
</div>
</form>
</div>
</p>
</body>
</html>
Fake News Detector 45
8. Key Insights
The passive aggressive model produces 93% accuracy. When we input the news
text on the interface, it correctly identifies the news most of the time. We tested this
by using news from The Onion. The Onion is a satire ‘news’ portal that posts fake
funny news. When we pasted some of the news from the site on our web interface,
those were correctly identified as fake. But when we wanted to test the news from
BBC or New York Times, those were correctly identified as real. But the accuracy
of the LSTM model was much lower, so we went with the Passive Aggressive
9. Conclusion
Our project can ring the initial alert for fake news. The model produces worse
the interface provides an easier way for the average person to check the
authenticity of a news. Projects like this one with more advanced features should
There are many future improvement aspects of this project. Introducing a cross
checking feature on the machine learning model so it compares the news inputs
with the reputable news sources is one way to go. It has to be online and done in
real time, which will be very challenging. Improving the model accuracy using
bigger and better datasets, integrating different machine learning algorithms is also
11. References
[2] T. Chen, L. Wu, X. Li, J. Zhang, H. Yin, and Y. Wang. Call attention to
[3] Y.-C. Chen, Z.-Y. Liu, and H.-Y. Kao. Ikm at several-2017 task 8:
2017.
[7] Bahad, P., Saxena, P. and Kamal, R., 2019. Fake News Detection using Bi-
pp.74-82.
[9] Fake News Detection on Social Media: A Data Mining Perspective Kai Shuy,
[10] CSI: A Hybrid Deep Model for Fake News DetectionIdentifying the signs of
fraudulent accounts using data mining techniques Shing-Han Li a,□, David C. Yen
[11] Automatic Deception Detection: Methods for Finding Fake News. Niall J.
Available:https://medium.com/greyatom/an-introduction-tobag-of-words-in-nlp-ac
967d43b428.
10 06 2017.
[Online].Available:https://www.bonaccorso.eu/2017/10/06/mlalgorithms-addendu
m-passive-aggressivealgorithms/