Fake News Detection Natural Language Processing
Fake News Detection Natural Language Processing
Chrysovalantis Christodoulou
University of Cyprus
April 2019
Acknowledgment
Moreover, I would like to thank the Ph.D. candidate Demetris Paschalidis for his
excellent guidance and the chance he gave me to learn an incredible amount of tech-
nologies.
Last but not least, I would like to acknowledge and thank my family for their valu-
able help during my studies. Specifically, I would like to thank my parents Stelios
and Maria and my sweetheart Elena for all the psychological and physical aid they
graciously provided to me.
i
Abstract
The recently increased focus on misinformation has spurred research in fact check-
ing, the task of assessing the truthfulness of a claim. People become victims of
fake news during their daily lives and support their further spread intentionally or
recklessly. The colossal propagation of information makes the necessity for an au-
tonomous fake news detection system more imperative than ever before. Despite the
undeniable need, there is not enough coverage because of the problem’s complexity.
In this thesis, we focus on fake news detection using Natural Language Processing
characteristics, and we examine the impact of a variety of different features. There-
fore, to do so, we gather an enormous dataset, and we extract an extensive amount
of features. We divide those features into three main different domains and six sub-
domains. Then we have performed a data visualization process in order to get a
further understanding of our features, and then we implement five feature selection
techniques on them. We end up with the twenty most prominent features, and then
we have trained a deep learning model with them. Moreover, we trained the model
with different collections of features, and compare those results with the top twenty
features. Using the top twenty features, we achieved a state-of-the-art outcome with
F1-score over 93%.
ii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related work 7
2.1 Fake News Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Methodology 14
3.1 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Dictionary Features . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Complexity Features . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 Stylistic Features . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Experiment 28
4.1 Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iii
CONTENTS CONTENTS
4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 TOP 20 Features Optimization . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusion 39
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
List of Figures
v
List of Tables
vi
Chapter 1
Introduction
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Motivation
In the recent years, fake news attracted growing interest from the general public
and researchers as the distribution of misinformation online advances, particularly
in media outlets such as social media feeds, news blogs, and online newspapers.
Journalist deal with misinformation spreading since the previous century and for
a long time they didn’t face tremendous obstacles. The evolution of the Internet
reinvents not only the journalist work but also the way people inform. Nowadays,
we escape from the daily newspapers and we have 24/7 instant information from
a bunch of different sources, many of them unsigned. The collosal propagation
of information made the need for an autonomous fake news detection system more
imperative than ever before. However, the spark that lights up the increasing interest
in misinformation spreading is the U.S 2016 elections.
1
1.1. MOTIVATION CHAPTER 1. INTRODUCTION
Governments and the general public get awake from the fact that the election of the
president of one of the most powerful countries in the word affected by fake news
dissemination. Figure 1.1 displays the increase of people searching about fake news
in Google and as you can witness the contrast before 2016 and later of that year
is crystal clear. The peak of 14/01/2018 is due to the tweets of president Donald
Trump, who attacks the Wall Street Journal as ‘fake news’ over his North Korea
comments.
Despite the undeniable need for an autonomous fake news classification tool, there
is very little work to cover the demand and the reason for that is the complexity of
the problem. It is tough even for humans to detect fake news. It can be claimed that
the only way for a person to manually identify fake news is to have a vast knowledge
of the covered topic. Furthermore, misinformation identification concerns a lot of
different disciplines, which may use inconsistence terminology and may not even
know each other. Vlachos and Throne [1] identify the issue and compose an article
to bridge the gap between those disciplines and gather all of the various approaches.
In addition, they declare the difficulties of each approach and they propose future
NLP research on automated fact-checking. In this research, we are focusing on how
Natural Language Processing will help us classify fake news articles.
2
1.2. CHALLENGES CHAPTER 1. INTRODUCTION
1.2 Challenges
By definition, Fake News detection is a challenging issue from every perspective.
Even for a journalist, the confirmation of the veracity of a news article constitutes
some thorough research around the topic. Moreover, a news article might not be
entirely false or maybe is not completely true and that drives to a dilemma about
the classification of it. Another challenging part includes the writing of those fraud-
ulent statements by expert journalists who are chosen to bias the common thought
to serve political interests as it happens in the U.S 2016 elections. The major issue is
that those journalists are experts to avoid common mistakes such as grammatical or
syntax and hide their purpose behind strong and powerful words. Thus, our natural
language analysis getting harder and harder and we have to take into account those
parameters.
The collection of data is another challenging part of the research because there
are not enough novel datasets to train a deep learning model and you need a lot of
“digging” to find something relevant. Then you have to be aware that in a novel
dataset some of the data might need preparation or deletion because they might
contain irrelevant information. For example, our dataset includes some Russian ar-
ticles which do not consider our work because we are working on English natural
language processing.
3
1.3. CONTRIBUTIONS CHAPTER 1. INTRODUCTION
Along with the previous challenges, we had to deal with some technical restric-
tions. As we will mentioned later, our research contributes to Check It which is a
plugin for the browser. Check It restriction is to produce every calculation in the
client browser because they want to avoid any GDPR issues and ensure the user
that nothing from his/her navigation sent to Check It for further calculation. That
drives to a serious processing limitation and force us to implement a lightweight, fast
and accurate method. Thus we had to turn down high processing power consuming
features such as N-grams and TF and ensure the fastest implementation of every
feature we used. Another challenge we had to address because of that demand, is
the extraction of every feature in JavaScript which is not established for natural
language processing and it contains a limited amount of libraries about that topic.
Thus, we extract every feature in Python, which is a more suitable programming
language for that purpose and then we transformed the code into JavaScript code.
Then we compare the features of Python and JavaScript to ensure the validity of
our results. Moreover, there was a need to train the model using features extracted
from JavaScript and thus we implement a NodeJS project to address it.
1.3 Contributions
The ultimate goal of our research is to examine how natural language processing
characteristics will help in this new age problem called Fake News automated de-
tection. Therefore, we analyze tones of news articles and we extract an enormous
amount of NLP features. We divide those features into three major classes: Dictio-
naries, Complexity, and Stylistic features and then we redivide those categories into
six subcategories in order to examine separately and combined their contribution.
The distribution of our features into different classes aims to organize natural lan-
guage processing features and help the researches decide a suitable category in their
case. Furthermore, we implement some very effective feature selection techniques to
analyze our feature as a union and discover the combination of them will produce
the best outcome. We showed that feature selection techniques help us gain the
best features of our dataset and deliver the best result. In addition, we examine
the impact of each subcategory separately and combined by training our model and
4
1.4. OUTLINE CONTENTS CHAPTER 1. INTRODUCTION
Moreover, we have the honor to contribute a big scale project funded by Google
called Check It, which is a browser extension aims to inform users about fake news
articles. Check IT is a system that combines, in an intelligent way, a variety of
signals such as, domain flag-lists, online social network, and natural language pro-
cessing characteristics, into a pipeline for fake news identification. To sum up, our
contributions are as follows:
• Examine the effect of different natural language process features in fake news
classification.
Chapter 1: Introduction
The introduction defines the motivation of our research and making clear how sig-
nificant is the addressing of the Fake News problem. Moreover, explains the contri-
bution of our study and the challenges we faced during our research.
Chapter 3: Methodology
5
1.4. OUTLINE CONTENTS CHAPTER 1. INTRODUCTION
The methodology chapter defines our methodology and explains each step we made
very precisely. We described how we find our datasets and the role of them in
our study. Moreover, we analyze the feature extraction process and how we divide
our features into three main different categories which are: Dictionary, Stylistic
and Complexity features. Then we present how we visualized our feature to get a
further understanding of them and then we mention our five different feature se-
lection methods we implement to analyze the importance of our features. Finally,
we described from an abstract point of you the structure of our deep neural network.
Chapter 4: Experiments
Chapter 4 focus on presenting the results of our experiments and describes precisely
the scores we used to measure them. We trained our model with some combination
of our categories and with the top 20 features of the feature extraction process and
then we compare the outcomes. We present a visual comparison of our results and
we provide a table with specific values for each experiment. Moreover, we optimize
our top 20 features’ result to cover the needs of Check It.
Chapter 5: Conclusion
Chapter 5 defines our conclusions thoughts and summarize them to provide a brief
outcome. Finally, we propose two significant important future works which will
extend our work and help the research with fake news detection.
6
Chapter 2
Related work
Contents
7
2.1. FAKE NEWS DETECTION CHAPTER 2. RELATED WORK
Regarding the difficulty of declaring an article as fake or real some other studies
proposed a different approach. For example, the article Five Shades of Untruth:
Finer-Grained Classification of Fake News [4] divides the content into five different
categories: Factual (true or mostly true), Incomplete Propaganda, Manipulative
Propaganda, Hoax, and Irony. This division helped the researches understand the
difference between fake news subcategories and identify the variations in the cor-
responding features. A similar approach is applied by William Yang Wang which
divides the fake news into six fine-grained labels for the truthfulness ratings: pants-
fire, false, barely- true, half-true, mostly-true, and true. These six subcategories
arranged with the help of Politifact users with the evaluation of the editors of the
site.
The attempting of defining fake news doesn’t stop in the previous approaches. Some
very effective approaches analyzing the network characteristics and the spread of fake
news through social media to extract features which help us to declare the truth-
fulness of an article. A serious study in the spreading of true and false news online
published in the Science Journal and the results of this study gave us a further
understanding of this phenomenon called fake news. Regarding the research which
is based on Twitter social media, the diffusion of fake news tweets are significantly
farther, faster, deeper, and more broadly than the truth in all categories of infor-
mation. Moreover, most of the fake tweets are coming from unverified users with
young age accounts and less number of followers and followers and the topic of the
tweet usually concern politics. Although, the diffusion of fake tweets are faster the
life span of them are short due to the revelation of the truthfulness.
All of the above approaches have a significant impact on the declaration of fake
news articles however, on this report I am focusing on natural language processing
characteristics.
8
2.2. NATURAL LANGUAGE PROCESSING CHAPTER 2. RELATED WORK
Despite the fact that convolutional neural networks most commonly applied to an-
alyzing visual imagery, they have promising results in natural language processing
too. Significantly, due to the ability, they have in recognizing patterns independent
of their position they can simply determine specific sentences or phrases which are
great indicators of the topic of an article [10]. In addition, it seems to have promising
results in sentiment classification [11], short-text categorization [12] and modeling
the relation between character-sequences and part-of-speech tags [13].
The advantage they have to recognize sentences and phrases despite their position
became a disadvantage when we want to keep the structure of the text unaffected.
This space can be cover by Recursive Neural Networks [14] and Recurrent Neural
Networks [15] which allows working with structure input and have great results in
natural language processing area.
Recurrent models have been shown to produce very strong results for language
modeling, including as well as for machine translation , dialog state tracking, de-
pendency parsing, response generation, sentiment analysis, sequence tagging, noisy
text normalization, and modeling the relation between character sequences and part-
9
2.3. FEATURE EXTRACTION CHAPTER 2. RELATED WORK
Most of the times when we have to extract textual features we have to apply some
data preparation methods. According to Detection of Online Fake News Using
N-Gram Analysis and Machine Learning Techniques [17] the text needs to be sub-
jected to certain refinements like stemming, stop-word removal and tokenization.
Tokenization can be either word tokenization, sentence tokenization even paragraph
tokenization and help us to split the text into tokens of individual words, sentences
or paragraphs. Stop words removal is an action after tokenization as the article
says and the goal of that step is to remove insignificant words of a language like ”a,
about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, these, this, too,
was, what, when, where, will, etc ”. Sometimes those words may create noise in text
classification because they are commonly used and usually doesn’t have anything
important to offer in the text classification process.Now stemming, is the process of
transforming the tokens into a standard form which means that change the word
10
2.3. FEATURE EXTRACTION CHAPTER 2. RELATED WORK
from its current form into its original form. The outcome of that process is the
reduction of word types or classes in the data. For example, the words “Running”,
“Ran” and “Runner” will be reduced to the word “run.” The authors use Porter
stemmer, which is the most commonly used and trustful stemming algorithm.
However, linguistic features don’t stop on n-grams. Another popular feature regard-
ing [20] is the analysis of punctuations in a text. In the journal article (Automatic
Detection of Fake News) the authors calculate the number of periods, commas,
dashes, question marks and exclamation marks in the text and the results confirm
Rubin. In the same article, they have also extract Psycholinguistic, Readability and
Syntax features. Regarding Psycholinguistic features, the authors used the famous
LIWC (Linguistic Inquiry and Word Count software ) which is a large lexicon help
us to understand the sentiment of the text. Readability features indicate text un-
derstandability using the number of complex words, the number of syllables and
some widely used readability index metrics such as Flesch-Kincaid, Flesch Reading
Ease, Gunning Fog, and the Automatic Readability Index (ARI). Moreover, they
have also produced CFG (Context Free Grammar) derived features which are a set
of recursive rewriting rules used to generate patterns of strings.
The authors of the article trained a model first using each feature category indi-
vidually and then combined all the features. Then they have compared the results
of the model and they ended with some conclusions. The combination of all features
gave them the 2nd best accuracy 74%, but the first places held by Readability index
11
2.4. FEATURE SELECTION CHAPTER 2. RELATED WORK
features 78%. The worst accuracy belonged to Psychological features with 56%.
There are several feature selection techniques such as filter methods which provide a
complete order of the features using a relevance index. Some classical test statistics
are T-test, F-test, Chi-squared test are considered as filter methods. Those meth-
ods, according to the book, select features without optimizing the performance of a
predictor. On the other hand, wrappers and embedded methods involve predictor
as part of the selection process. Wrappers divide the features into subsets and using
the predictor calculates the predictive accuracy of each subset. Embedded methods
perform feature selection in the process of training and are usually specific to given
learning machines.
A famous method for feature selection is Individual relevance ranking which means
that we are testing how much an individual feature affects the model prediction.
However, this method carries a serious disadvantage. Features that are not individ-
ually relevant may become relevant in the context of others and features that are
individually relevant may not all be useful because of possible redundancies. Thus
we justify the use of multivariate methods, which make use of the predictive power of
features considered jointly rather than independently. The Relief method considers
as one of the most commonly used methods to calculate the impact of subsets on a
model. Moreover, it has the advantage to take into account the feature redundancy
12
2.4. FEATURE SELECTION CHAPTER 2. RELATED WORK
Two well know greedy methods are Forward and Backward selection algorithms.
Both of them are greedy algorithms and consume a lot of time and resources to de-
liver an outcome. Although, both methods are time and resources consuming they
provide very good results because they examine every single combination of features
and give us the one which produces the best results. The only difference between
these methods is the way they build the subsets. The forward algorithm chooses a
subset of the predictor variables for the final model, instead of the backward algo-
rithm which begins with the full least squares model containing all predictors, and
then iteratively removes the least useful predictor, one-at-a-time.
13
Chapter 3
Methodology
Contents
14
3.2. DATA COLLECTION CHAPTER 3. METHODOLOGY
not focusing on constructing and training of the deep neural network and we are
using it as a black box. However, our job is to ensure the validity of the extracted
features and by examining the importance of the features feed the model with the
most significant features. Achieving these steps we will increase the accuracy of the
model and reduce the possible noise of irrelevant features. In conclusion, we want
to guarantee the best possible training for the deep neural network model.
15
3.2. DATA COLLECTION CHAPTER 3. METHODOLOGY
Politifact and Buzzfeed datasets used only for evaluating the correctness of our
feature extraction process and our deep learning model. Both datasets have an
inadequate number of records for training a machine learning model. However, they
are very useful for testing our model because they are tagged very precisely by
journalist and we are utterly positive for the veracity of each article.
Kaggle is a platform for predictive modeling and analytics competitions in which
statisticians and data miners compete to produce the best models for predicting and
describing the datasets uploaded by companies and users. The dataset we found in
Kaggle contains 20.8k of balanced fake and reliable news articles label using the B.S
detector plugin. B.S. Detector searches all links on a given webpage for references
to unreliable sources, checking against a manually compiled list of domains. It then
provides visual warnings about the presence of questionable links or the browsing
of questionable websites. Kaggle dataset contains a quite large amount of reliable
and unreliable news articles. An advantage of this dataset is that includes the title
and the content of an article and give us the opportunity not only investigating the
content of an article but also the combination of title and content. We used this
dataset for training and testing the model.
16
3.3. FEATURES EXTRACTION CHAPTER 3. METHODOLOGY
ID: 18 ID: 1
Title: FBI Closes In On Hillary! Title: FLYNN: Hillary Clinton,
Author: The Doc Big Woman on Campus - Breit-
Text: We now have 5 separate bart
FBI cases probing the Hillary- Author: Daniel J. Flynn
Bill Clinton inner circle. We Text: Ever get the feeling your
now have 5 separate FBI cases life circles the roundabout rather
probing the Hillary-Bill Clinton than heads in a straight line to-
inner circle... ward the intended destination...
Label: 1 Label:true 0
Fake news corpus dataset is the largest dataset we found and contains more than 9
million articles. However, these articles originate from a curated list of 1001 domains
collected from opensources.co. The entries are divided into 12 groups: fake news,
satire, extreme bias, conspiracy theory, rumor mill, state news, junk science, hate
news, clickbait, political, and credible. In this research we collect only fake articles
which are 928,083 and credible articles which are 1,920,139.
17
3.3. FEATURES EXTRACTION CHAPTER 3. METHODOLOGY
Dictionary features are based on well-studied word counts that are associated with
various psychological processes and basic sentiment analysis. We use several dictio-
naries such as Laver & Garry, Loughran Mcdonald, Martindale’s Regressive Imagery
Dictionary (RID) and AFINN dictionary to analyze the tone, the sentiment even
the personal concerns of the writer. The Laver & Garry dictionary has been devel-
oped to estimates the policy positions of political actors in the United Kingdom by
comparing their speeches and written documents to keywords found in the British
conservative and Labour manifestos of 1992. Words have been decided semantically
based on how they were related to specific content categories as well as empirically
based on how specific they were to a specific political party. The dictionary holds
415 words and word patterns stored in 19 categories. Loughran-McDonald dictio-
nary classifies the words into 7 sentiment categories (Negative, Positive, Uncertainty,
Litigious, Strong Modal, Weak Modal, Constraining). The English Regressive Im-
agery Dictionary (RID) is composed of about 3200 words and roots attached to 29
divisions of primary process cognition, 7 categories of secondary process cognition,
and 7 categories of emotions. Moreover, it’s very important to mention the AFINN
dictionary which is a list of English terms manually rated for valence with an integer
between -5 (negative) and +5 (positive) by Finn Arup Nielsen. AFINN lexicon looks
to plays a significant role in fake news classification.
Complexity features supply us some truly valuable features using some obscure tech-
niques to extract features that show us the readability index and the vocabulary
richness of the text. Readability index is the comfort which a reader can un-
derstand a written text. In natural language, the readability of text depends on its
content. It focuses on the words we choose, and how we put them into sentences and
paragraphs for the readers to comprehend. The further we use complex vocabulary
and syntax the further our readability score is decreased because the text becomes
understandable for fewer people. Some well known methods to calculate readability
index are the following: Flesch, Flesch kincaid, McLaughlin’s SMOG formula, Cole-
18
3.4. DATA VISUALIZATION CHAPTER 3. METHODOLOGY
man liau, Automated Readability Index, Dale–Chall formula, Gunning fog formula,
Linsear formula . Vocabulary Richness is highly correlated with the word’s fre-
quency in a text. There are several techniques to measure vocabulary richness such
as TTR, BrunetsW, HonoreR, SichelS, and Yule. Most of the techniques take into
account the number of unique words in a text.
Stylistic features reflect the style of the writer and help you understand the syntax of
the text. In this research, we extract more than 90 stylistic features and discovered
that some of them play a significant role in fake news classification. Moreover, only
stylistic features provide us the ability to study the title’s impact in the veracity of
an article. Dictionary and complexity features based on extensive texts to provide
accurate results and they are not suitable for the title’s content. Some examples of
stylistic features are the number of lines, the number of words, the average number
of words begin with a capital letter, the radio of digits and the number of stopwords
in a text.
19
3.4. DATA VISUALIZATION CHAPTER 3. METHODOLOGY
and offer him/her a better awareness of the features. We used the R programming
language to produce our visualize data with the help of the ggplot library. We pro-
duce two types of graphs: Bar Mean Plot and Box Plot. Bar mean plot displays
the central value of a discrete set of numbers. Boxplots can inform you about your
outliers and what their values are. It can also tell you if your data is symmetrical,
how tightly your data is grouped, and if and how your data is skewed. Figure 3.3
shows you exactly the form of a box plot.
A feature that stands out is the average number of all capital words in a sentence
showing that fake news articles may use more capital words in order to attract the
user’s attention. Figure 3.4 shows the average number of all capital words in a
sentence in both Politifact dataset 3.4a and Kaggle dataset 3.4b.
Both datasets agree that fake articles include a larger average number of upper-
case words in a sentence and that can be explained because fake articles may don’t
have enough arguments to persuade a user and try to impress him/her using capital
words. Capital words are connected with shouting and strong modal. Furthermore,
capital words usually used in titles to attract users to click an article and steal their
attention from their regular activity.
The Feature 3.5 sketches the box plot for AFFIN word score for PolitiFact dataset.
20
3.4. DATA VISUALIZATION CHAPTER 3. METHODOLOGY
As the graph displays, fake news interquartile range is greater than reliable and lies
on the x-axis which means most of the fraudulent data possess AFFIN word score
near zero. In contrast, reliable news favor to the positive side of the x-axis and show
us that most of the real articles have positive AFFIN word score. Moreover, fake
news appears to have more outliers and the distribution of values are sparser than
real news which maybe means that real news has homogeneity instead of fake news.
Those conclusions can easily be extracted from data visualization and despite the
feature selection process also helps you understand your data, data visualization is
21
3.5. FEATURE SELECTION CHAPTER 3. METHODOLOGY
4. Features with low importance: The same feature importance used in the
above method, are also utilized here. Features with the lowest importance do
22
3.5. FEATURE SELECTION CHAPTER 3. METHODOLOGY
not contribute to specified total importance. This can be done using Principal
Components Analysis (PCA) where it is common to keep only the PC needed
to retain a certain percentage of the variance, such as 95%. The percentage of
total importance accounted for is based on the same idea.
During the feature selection process for fake news, the aforementioned methods were
applied on the dataset, individually for articles’ titles and content. This separation
was performed due to the fact that titles and content are different in nature and
combining them may affect the performance of the selection. The first method does
not produce results either for content or title, meaning that the dataset analyzed
does not contain any missing values. The unique values method detected 97 features
to be removed in the content features and 104 features to be removed from the title
features Figure 3.6.
The collinearity method detected 13 content features and 4 title features that are
highly associated with correlation magnitude greater than 0.975 Figure 3.8. Highly
correlated content features include the total number of characters, total number of
23
3.5. FEATURE SELECTION CHAPTER 3. METHODOLOGY
words and the total number of words beginning with a lowercase letter.
Figure 3.8: Highly correlated content features above the threshold defined as 0.975
From the initial 311 features, 149 are content features and 44 title features were
observed to have zero importance using the tree-based model. After the detection
of the zero importance features, the method for the detection of the low importance
features was applied. The method resulted in 182 content features and 193 content
features that do not contribute to a cumulative importance of 0.99. The procedure
of filtering the features that do not contribute to the learning of the model and the
successful classification of fake and real news, resulted to top 20 more important
features Figure 3.9. Moreover, Table 3.2 shows the important score, after Feature
Selection process, of each feature from the most important feature which is the total
24
3.5. FEATURE SELECTION CHAPTER 3. METHODOLOGY
number of lines to the twentieth most important feature which is the average num-
ber of stopwords per sentence in a title.
25
3.6. DEEP NEURAL NETWORK CHAPTER 3. METHODOLOGY
Table 3.2: Table with the 20 most important features as resulted from the feature
selection process.
26
3.6. DEEP NEURAL NETWORK CHAPTER 3. METHODOLOGY
Classification
Dense Layer
512 Neurons
256 Neurons
128 Neurons
Dense Layer
Dense Layer
Dense Layer
Dense Layer
Input Layer
64 Neurons
32 Neurons
Preal
Layer
Pf ake
27
Chapter 4
Experiment
Contents
4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2. False Positives (FP): The total number of inaccurate predictions that were
“positive.” In our example, this is the total number of wrongly predicting an
article as fake.
28
4.1. CLASSIFIER PERFORMANCE CHAPTER 4. EXPERIMENT
3. True Negative (TN): The total number of accurate predictions that were
“negative.” In our example, this is the total number of correctly predicting
article as non-fake.
4. False Negative (FN): The total number of inaccurate predictions that were
“negative.” In our example, this is the total number of incorrectly predicting
article as non-fake.
Explaining those metrics leads to the explanation of Accuracy score which calculated
by using equation 4.1. Explaining those metrics leads to the explanation of Accuracy
score which calculated by using equation 3.1. Accuracy only works when both
possible outcomes (article being fake or not) is equal. For example, if we have a
dataset where 5% of their articles are fake, then we could follow a less sophisticated
model and have better accuracy score. We could predict every article as non-fake
and achieve a 95% accuracy score. The imbalance dataset makes accuracy, not a
reliable performance metric to use. The paradox explained is refer as “Accuracy
Paradox,”
TP + TN
Accuracy = (4.1)
TP + FP + TN + FN
Now, we have to return to the calculation of F1-score which needs precision and recall
scores. Precision refers to the evaluation of our model using positive predictions.
In our case, a positive prediction is to classify a news article as fake. The precision
score computed using equation 4.2.
TP
P recision = (4.2)
TP + FP
Recall refers to evaluating our data by its performance of the ground truths for
positive outcomes. Meaning that we measure how well predicted positive when the
results are actually positive. The recall score computed using equation 4.3.
TP
Recall = (4.3)
TP + FN
Finally, F1-score is the weighted average of Precision and Recall and takes into
account both false negatives and false positives. Thus, F1-score consider as a better
metric than accuracy. F1-score is computed using equation 4.4.
2 ∗ (P recision ∗ Recall)
F 1 − score = (4.4)
P recision + Recall
29
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
We used those metrics for evaluating our model with different categories features.
As we mention in the feature extraction section, we divide our feature into three
main categories and those categories into 6 subcategories. We train our model with
each subcategory separately and with some combinations and we calculate those
metrics for each combination.
4.2 Comparisons
30
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
4.2.2 Results
This section presents the results of our study and provides a summary table 4.1
which contains the accuracy, precision, recall, and F1-score for each feature combi-
nation we tried. Furthermore, we create a visual comparison of each combination
with our Top 20 Features model to give a better understanding of the feature’s im-
pact on the outcome. The natural language processing feature that stands out is the
POS tags, which as we explained in the feature extraction section, is the tagging of
each word with a label. POS tags reach F1-score 0.873, which is the higher of every
combination we made but still, is much less than our top 20 features model which
owns 0.93 F1-score. As we explained in the feature selection section, we applied five
different feature selection methods which lead to the top 20 best features for our
dataset.
31
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
The feature’s category with the lowest F1-score is the sentiment, with 0.767. The
explanation of this fact maybe is the kind of articles our dataset contains. Most
of them are news articles which have a similar tone if they are fake and does not
provide any certain emotions. Furthermore, due to our performance limitations, we
can not use an expert sentiment analysis tool and that may affect the sentiment
score. The following graphs present the visual comparison of every combination
with the top 20 features.
32
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
(a) (b)
(c) (d)
33
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
(a) (b)
(c) (d)
(e) (f)
34
4.2. COMPARISONS CHAPTER 4. EXPERIMENT
(a) (b)
(c) (d)
(e) (f)
(g)
35
4.3. TOP 20 FEATURES OPTIMIZATION CHAPTER 4. EXPERIMENT
Figure 4.4 shows us that from 5000 fake articles we correctly define the 4650 as
fake. However, we define 349 news articles as reliable wrongly. Therefore, our false
negative percentage is 7%, and our false positive rate is 93%. Regarding reliable
news articles, we define 4632 as reliable and 367 as unreliable. Therefore, our true
negative percentage is 7%, and our true positive rate is 93%. Those outcomes ex-
tracted with a default threshold value which is 50%. That means that if the model
outputs a percentage of 0.51 for an article to be reliable and 0.49 for an article to
be fake, it means that we set the article as reliable.
As we mention in the Introduction, our thesis will contribute a largest scale project
call Check It. Check It has a very significant requirement, which is the elimina-
36
4.3. TOP 20 FEATURES OPTIMIZATION CHAPTER 4. EXPERIMENT
tion of TN. Accordingly, we do not desire to classify news as fake when is reliable.
Therefore, we evaluate our model in the same dataset, but this time, we had in-
creased the threshold gradually to examine the impact on TN. We conclude with a
threshold of 99% in order to vanish TN. Figure 4.5 presents in one confusion matrix
the metrics for both thresholds. One the one hand, as we can observe from the ma-
trix, we achieved to vanish any true negatives, because of the level of our threshold.
We classify a news article as fake only if we are 99% sure about it. On the other
hand, increasing our threshold affects the number of FP and the following graph
4.6 represents the number of FP and TN as a function of our threshold, starting
from 0.50 to 0.99 with a step of 0.01. As we observe from the graph, the more we
increase the threshold, the more we decrease the number of true negatives and the
more we increase the number of false positives. However, the decreasing of true
negatives is linear and the increasing of FP is exponential. We have to deal with
the trade-off, because of our requirements. In order to evaluate the generalization
of our model and the performance with the modified threshold, an additional eval-
uation was made on several authoritative news articles from sources including The
Guardian, New York Times, CNN and BBC. Specifically, 1158 news articles used
as input to the DNN model with the adjusted threshold, from which only a single
37
4.3. TOP 20 FEATURES OPTIMIZATION CHAPTER 4. EXPERIMENT
Figure 4.6: Number of False Positives and True Negatives asFunction of Threshold
38
Chapter 5
Conclusion
Contents
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Conclusion
Taking everything into consideration, the goal of this study is to examine the impact
of different natural language processing features and contributes to the fake news
classification challenge. We strongly believe that we achieve this objective and we
offer some valuable information which will help researches to address this very taff
problem. The variety of features we used and the amount of our dataset gives
a state-of-the-art outcome. The experiments we cover suggest the use of feature
selection methods to extract the best features of your dataset and achieve a higher
possible score. Moreover, we figure out that POS tags have a significant impact on
our classification and we proofed that low processing features can also achieve state-
of-the-art outcomes. The number of features we extracted and the categorization
of them into three main domains and six sub-domains aims to organize natural
language processing features and help the researches decide a suitable category in
their case. We are convinced that natural language processing is one of the most
critical keys to overcome this challenging issue called Fake News.
39
5.2. FUTURE WORK CHAPTER 5. CONCLUSION
Moreover, a great future work will be the examination of the evolution of our fea-
tures during the years. As we all know the defense against fake news are improved,
however, the amount of fault news does not seem to decrease and the main reason
for that is the need of some people to bias the social opinion. Therefore, fake news
articles evolved and try to avoid previous common mistakes which help automated
fact-checking systems to detect them. The collection of old and recent datasets and
the extraction of their feature will help us to identify the differences. The outcome of
research in the evolution of natural language features will be extremely valuable for
the community and will help us understand further how can we strike Fake News.
40
Bibliography
[2] H. Allcott and M. Gentzkow, “Social Media and Fake News in the 2016 Elec-
tion”, Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–236, May
2017, issn: 0895-3309. doi: 10 . 1257 / jep . 31 . 2 . 211. [Online]. Available:
http://pubs.aeaweb.org/doi/10.1257/jep.31.2.211.
[3] B. D. Horne and S. Adali, “This Just In: Fake News Packs a Lot in Title, Uses
Simpler, Repetitive Content in Text Body, More Similar to Satire than Real
News”, 2017. arXiv: 1703.09398. [Online]. Available: http://arxiv.org/
abs/1703.09398.
[5] Y. Goldberg, “A primer on neural network models for natural language pro-
cessing”, Journal of Artificial Intelligence Research, vol. 57, pp. 345–420, 2016,
issn: 10769757. arXiv: arXiv:1510.00726v1.
[6] D. Weiss, C. Alberti, M. Collins, and S. Petrov, “Structured Training for Neu-
ral Network Transition-Based Parsing”, no. 2012, pp. 323–333, 2015. arXiv:
1506.06158. [Online]. Available: http://arxiv.org/abs/1506.06158.
41
BIBLIOGRAPHY BIBLIOGRAPHY
- WMDD ’15, New York, New York, USA: ACM Press, 2015, pp. 15–19, isbn:
9781450339872. doi: 10.1145/2823465.2823467. [Online]. Available: http:
//dl.acm.org/citation.cfm?doid=2823465.2823467.
[8] G. Durrett and D. Klein, “Neural CRF Parsing”, pp. 302–312, 2015. arXiv:
1507.03641. [Online]. Available: http://arxiv.org/abs/1507.03641.
[9] W. Pei, T. Ge, and B. Chang, “An Effective Neural Network Model for Graph-
based Dependency Parsing”, pp. 313–322, 2015. doi: 10.3115/v1/p15-1031.
[10] R. Johnson and T. Zhang, “Effective Use of Word Order for Text Categoriza-
tion with Convolutional Neural Networks”, no. 2011, 2014. arXiv: 1412.1058.
[Online]. Available: http://arxiv.org/abs/1412.1058.
[15] J. L. Elman, “Finding structure in time”, Cognitive Science, vol. 14, no. 2,
pp. 179–211, 1990, issn: 03640213. doi: 10.1016/0364-0213(90)90002-E.
[16] Richard Socher John Bauer Christopher D. Manning Andrew Y. Ng, “Parsing
with Compositional Vector Grammars”, Journal of the American Chemical
Society, vol. 80, no. 19, pp. 5080–5083, 2013, issn: 15205126. doi: 10.1021/
ja01552a021.
42
BIBLIOGRAPHY BIBLIOGRAPHY
[17] H. Ahmed, I. Traore, and S. Saad, “Detection of Online Fake News Using
N-Gram Analysis and Machine Learning Techniques”, in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), vol. 10618 LNCS, Springer, Cham, Oct.
2017, pp. 127–138, isbn: 9783319691541. doi: 10.1007/978-3-319-69155-
8_9. [Online]. Available: http://link.springer.com/10.1007/978-3-319-
69155-8%7B%5C_%7D9.
[20] V. Rubin, N. Conroy, Y. Chen, and S. Cornwell, “Fake News or Truth? Us-
ing Satirical Cues to Detect Potentially Misleading News”, Proceedings of
the Second Workshop on Computational Approaches to Deception Detection,
2016. [Online]. Available: http://www.academia.edu/24790089/Fake%7B%
5C _ %7DNews % 7B % 5C _ %7Dor % 7B % 5C _ %7DTruth % 7B % 5C _ %7DUsing % 7B % 5C _
%7DSatirical%7B%5C_%7DCues%7B%5C_%7Dto%7B%5C_%7DDetect%7B%5C_
%7DPotentially%7B%5C_%7DMisleading%7B%5C_%7DNews.
[22] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake News Detection on
Social Media”, ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22–
43
BIBLIOGRAPHY BIBLIOGRAPHY
36, Sep. 2017, issn: 19310145. doi: 10 . 1145 / 3137597 . 3137600. [Online].
Available: http://dl.acm.org/citation.cfm?doid=3137597.3137600.
44
Appendices
.0-1
Appendix A
A-1-2
A-1. DICTIONARY FEATURES
A-1-3
A-1. DICTIONARY FEATURES
A-1-4
A-1. DICTIONARY FEATURES
A-1-5
A-1. DICTIONARY FEATURES
A-1-6
A-1. DICTIONARY FEATURES
A-2-7
A-2. COMPLEXITY FEATURES
Feature Definition
Readability Index
Flesch–Kincaid
SMOG
Automated readability
index
Dale-Chall
Coleman–Liau
L = Letters / Words * 100 = 639 / 119 * 100 =
537 S = Sentences / Words * 100 = 5 / 119 * 100
= 4.20
Gunning fog
Vocabulary Richness
Yule K The larger the number the less rich is the vocabu-
lary of the text
TTR The larger the number the richer is the vocabulary
of the text
Brunets The larger the number the richer is the vocabulary
of the text
Sichel The larger the number the richer is the vocabulary
of the text
A-3-8
A-3. STYLISTIC FEATURES
Feature Meaning
Part Of Speech Tags
CC Coordinating conjunction
CD Cardinal digit
DT Determiner
EX Existential there (like: “there is” . . . think of it
like “there exists”)
FW Foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
MD modal could, will
RB adverb very, silently
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
A-3-9
A-3. STYLISTIC FEATURES
Feature Meaning
Structural
total number of sentences
total number of words
total number of characters
total number of begin upper Words with first capital
letter
total number of begin lower Words with first lower-
case letter
total number of all caps Word with all capital let-
ters
total number of stopwords
total number of lines
number of I pronouns
number of we pronouns
number of you pronouns
number of he she pronouns
number of exclamation marks
A-3-10
A-3. STYLISTIC FEATURES
number of quotes
number of happax legomena
number of happax dislegomena
has quoted content
ratio alphabetic
ratio uppercase
ratio digit
avg number of characters per word
avg number of words per sentence
avg number of characters per sentence
avg number of begin upper per sentence
avg number of all caps per sentence
avg number of begin lower per sentence
avg number of stopwords per sentence
5.3-11