Fake News Classification Using Machine Learning Techniques
Fake News Classification Using Machine Learning Techniques
ISSN No:-2456-2165
Abstract:- Fake news exerts a pervasive and urgent Unreliable news creating for financial or political motives or
influence, causing mental harm to readers. to gain notoriety, using ideological narratives to deceive the
Differentiating between fake and genuine news is receivers [5],[6]. TThis unreliable content, news
increasingly tricky, impacting countless lives. This manipulation, knowledge bubbles, and a lack of security on
proliferation of falsehoods spreads harm and social platforms have become a pervasive disadvantage in
misinformation and erodes trust in global information our society.
sources, affecting individuals, organizations, and nations.
It requires immediate attention. To address this issue, we Not only is unreliable news prevalent in traditional
conducted a comprehensive study utilizing advanced media, but it has also gained prominence in social forums,
techniques such as TF-IDF and feature engineering to allowing it to spread quickly and extensively [2]. Clickbait,
detect fake news. WWe proposed Machine Learning often with catchy headlines, is commonly used to attract
Techniques (MLT), including Naïve Bayes (NB), readers' attention [7]. By clicking on these enticing titles,
Decision trees (DT), Support Vector Machines (SVM), readers leading to poorly written articles with little
Random Forest (RF), and Logistic Regression (LR) to relevance to the news they were expecting. Clickbait aims to
classify news articles. Our studies involved analyzing drive more traffic to websites that rely on advertisements for
word patterns from diverse news sources to identify revenue. An infamous example occurred during the 2016
unreliable news. We calculated the likelihood of an presidential election, where Russian trolls used clickbait to
article being fake or genuine based on the extracted sway public opinion away from Donald Trump toward
features and evaluated algorithm accuracy using a Hillary Clinton. This instance illustrates the considerable
carefully crafted training dataset. The analysis revealed influence that false information can exert on important
that the decision tree algorithm exhibited the highest matters. Social media platforms have evolved into
accuracy, detecting fake news with an impressive 99.68% environments where untrustworthy news, characterized by
rate. While the remaining algorithms performed well, errors, informal language, and flawed grammar,
none surpassed the accuracy of the decision tree. TThis proliferates[8]. The quest for improved credibility and
study highlights the immense potential of machine accuracy has created an urgent need for techniques that help
learning techniques in combating the pervasive menace users make informed decisions [6].
of leaks. Our research presents a reliable and efficient
method to identify and classify unreliable information, Websites like Snopes and Politifact have emerged to
fact-check news articles and uphold the truth. Research
Safeguarding the integrity of news sources and
protecting individuals and societies from the harmful studies have also developed repositories to identify genuine
effects of misinformation. and fraudulent internet sources [9]. In light of these
discussions, categorizing unreliable news hinges on purpose
Keywords:- Machine Learning, TF-IDF, Feature and authenticity. Authenticity refers to false news
Extraction, Fake News Detection, social media. containing inaccurate information. The second factor
involves deliberately manipulating the news content to
I. INTRODUCTION deceive the audience [10].
Fake news is a term used to describe inaccurate or The main challenge lies in distinguishing between fake
deceptive information that is presented as genuine news. and real news [11]. Different social media platforms
This can encompass fabricated narratives, overstated or recognize false news through Extraction Features (FE),
altered facts, and deliberately misleading content[1]. while traditional news societies rely on various factors, such
However, with new technologies, the internet has made it as images and text, to identify and spot fake news. I In
possible for people to access news from all over the world, terms of textual word-based sources, there are several
at any time and on any device. The internet, primarily aspects to consider:
through social media and other media applications, has
become the primary platform for spreading fake news. It is essential to determine whether the article news
Despite the abundance of information available, the truth carries the original content or just a part of it.
often needs to be clarified [2]. The purpose behind the
spread of fake news is to manipulate the audience, whether The authenticity of the news source needs to evaluate,
for political or commercial gain [3]. In today's digital knowing who published the news is crucial.
landscape, a vast amount of news is published across
various media outlets, making it increasingly challenging to
discern between accurate and false information [4].
Figure 3 provides an overview of the distribution of reputable outlets such as the Washington Post, New York
articles across different subjects. It includes 1,570 articles Times, CNN, etc. This study's findings validate the proposed
related to government news, 778 articles about medals in the model's effectiveness in identifying fake news articles by
East, 9,050 general articles, 783 articles on US news, 4,459 analyzing their text using machine learning algorithms. This
articles categorized as Left news, 11,272 articles labeled as P approach dramatically streamlines the decision-making
news, 10,145 articles covering world news, and 6,831 articles process.
focusing on politics. These articles are sourced from
Moreover, Figure 4 and Figure 5 display word clouds with each category, offering an insight into the most
that have been generated based on the identified fake and real frequently occurring words found in both fake and real news
news within the system, respectively. These word clouds articles.
visually represent the presence of multiple words associated
Table 2: Parameter Word cloud for fake and real dataset news
Target Height Max_Font_Size Collocations Ggenerate Figsize Width Interpolation
ture 500 110 False all_words 10,7 800 Bilinear
In Table 2, the most important properties for applying Furthermore, Figure 6 and Figure 7 provide visual
common words in real and fake news articles are presented. representations of the distribution of these common words.
C. Lowercase transformation and stemming: Snowball stemming implementation [30] is utilized to reduce
In this step, all terms in the dataset are transformed to phrases to their stem forms. This rule-based approach aids in
lowercase to accommodate variations in capitalization. reducing the word corpus while preserving the
Moreover, stemming is applied using the NLTK's WordNet meaningfulness of the words.
stemming implementation [8]. Conversely, the NLTK's
D. Feature Extraction (FE): taking the logarithm of the ratio between the total number of
The main challenge in news categorization is dealing with documents in the corpus and the number of documents in
high-dimensional data. The presence of numerous document which the specific phrase appears[3].
terms, phrases, and words can lead to increased computational
limitations in learning. Additionally, redundant and irrelevant Term frequency (TF) is a method that uses the
features can hinder the interpretation of classifiers. Therefore, occurrence counts of terms in documents to determine the
it is crucial to perform feature reduction and transform the text similarity between documents. Each vector is then normalized
into numerical features that can be further processed while so that the sum of its elements corresponds to the total word
preserving the dataset [29]. count and represents the probability of a specific phrase
existing in the documents [32].in the following equation:
The CountVectorizer of Words describes the occurrence
of terms within news articles. It assigns a value of 1 if a term TF = (Number of occurrences of a term in a document)) / /
(Total number of terms in the document) (1)
is present in the sentence and 0 if it is not. This creates a bag-
of-words document matrix for each text document. N-grams IDF= log D/(1 + DF) (2)
are combinations of adjacent terms or phrases of length "n"
Where:
that can be found in the original text [31].
D is the total number of documents in the collection.
TF-IDF (Term Frequency-Inverse Document Frequency) DF is the number of documents containing the term.
is a widely used weighting metric in dataset analysis. It is a
statistical measure that evaluates the importance of a phrase to [33]. For every word present in a dataset row, the value
a document in an article news. The reputation of a phrase is non-zero, and if the word is not present, the value is zero.
increases with the number of occurrences within the document The TF-IDF of a token is calculated using the following two
equations:
but is also influenced by its frequency in the entire corpus.
The IDF (Inverse Document Frequency) It is computed by TF − IDF = TF ∗ IDF (3)
E. Feature engineering for fake news detection Support Vector Machine (SVM):
Feature engineering (FET) is crucial for enhancing the SVM is a classification model that helps identify patterns
performance of any machine learning algorithm, including its in data for regression and classification. It creates learning
application to extract features from datasets. Transforming processes from class training datasets and has a sound
the raw dataset into feature data improves the quality of the theoretical basis. SVM requires a relatively small number of
model and enables achieving sufficient accuracy [30]. FET samples compared to the dimensions of the data. It addresses
involves converting the original values and applying them the problem of discriminating between components of two
during the feature engineering step. There are various classes using dimensional vectors.
techniques available for feature engineering, and sometimes
it can be unclear which methods fall under the scope of FE Logistic Regression (LR):
and which do not [37]. LR is a classification model used for predicting the
outcome of a categorical dependent variable based on
F. Algorithms Used for Classification predictor features. It can handle numeric or categorical
Machine learning (ML) in real-time during the predictors and a categorical label. LR estimates discrete
experimentation has a rapid impact on categorizing unreliable values and predicts the probability of an event occurring,
news. We use the following ML algorithms, such as Naïve with values between 0 and 1.
Bayes (NB), decision tree (DT), Random Forest (RF), SVM,
and Logistic regression (LR), to detect anomalies and analyze Evaluation Matrix:
the effectiveness of our progressive algorithms. We use various assessment measures and evaluation
metrics to analyze the efficiency of the model in detecting
Naïve Bayes (NB): false news articles.
The NB algorithm provides a probabilistic model-making
technique. It computes the probability of each label variable's Accuracy: indicates the proportion of accurate
importance for conveyed input variable significances. By predictions relative to the numeral of possible ones[8].
using dependent probabilities for an unexplored record, the
model calculates the result of all target class weights and 𝐴𝑐𝑐. =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
predicts the most likely outcome. NB is a classification 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
(4)
algorithm that is probabilistic and supervised, originally
developed by Thomas Bayes. It is easy to interpret and Recall: its point to the percentage of relevant measures
efficient for computation. retrieved from the whole numeral of relevant computed and
instances[9].
Decision Tree (DT):
The DT algorithm partitions data into two or more 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (5)
subsets based on the similarity of samples. It is a recursive
process that splits subsets and repeats the process until a
F measure (F1 or F-score): harmonic mean of recall and
stopping condition is satisfied. Each decision node tests the precision [10]given by:
values of specific data functions, and each branch
corresponds to a different test outcome. Decision trees are 2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝑓 − 𝑠𝑐𝑜𝑟𝑒 = (6)
efficient for making classifiers and can handle both 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
categorical and continuous variables.
Precision indicates the percentage of actual test
Random Forest (RF): outcomes predicted accurately by dividing the numeral of
RF is a collection of tree predictors that depend on the correct predictions by the numeral of inaccurate ones[11].
same distribution for all trees. It uses a random vector to 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
sample features independently in each tree. The prediction 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑡𝑖𝑣𝑒 (6)
error for random forests converges, and they have the
advantage of being robust against noise. This section presents the output or results of identifying
fake news; the common word accurate and most common
The investigation results presented in Table 5 depict model showcases high accuracy (99.03%), precision
the performance of various classification models in (99.37%), recall (98.78%), and F1-score (99.07%),
identifying fake news using TF-IDF feature extraction and emphasizing its effectiveness in fake news classification.
feature engineering techniques. Naïve Bayes demonstrates Figure 10 provides a visual representation of the comparison
impressive results, boasting an accuracy of 99.55%, a results, offering a comprehensive overview of each model's
precision of 99.72%, a recall of 99.42%, and an F1-score of performance. These findings underscore the effectiveness of
99.56%, underscoring its exceptional performance in fake TF-IDF feature extraction and feature engineering
news classification. The decision tree model stands out with techniques in enhancing classification accuracy and overall
the highest accuracy at 99.68% and exhibits commendable model performance. In summary, all models demonstrate
precision, recall, and F1 score. In contrast, SVM delivers strong capabilities in identifying fake news, with the
satisfactory performance, achieving an accuracy of 94.86%, decision tree model achieving the highest accuracy. Naïve
precision of 96.75%, recall of 93.36%, and an F1-score of Bayes, logistic regression, and random forest models also
95.02%, albeit not reaching the levels attained by Naïve exhibit excellent performance, while SVM delivers
Bayes or the decision tree model. The logistic regression satisfactory results. These outcomes emphasize the efficacy
model performs reasonably well, securing an accuracy score of the employed techniques in detecting fake news and
of 98.73% and displaying good precision, recall, and F1- provide valuable insights into the strengths of different
score (99.53%, 98.78%, and 99.15%, respectively), although classification models.
there is potential for higher results. The random forest
sequence follows: The distribution of the classification set. Accuracy was utilized to calculate the F1 score,
matrix of the ML algorithm is visualized in Figure 11. It precision, and recall as it pertains to the classification of the
depicts the number of instances for each class in the testing classes.
Fig. 12: Confusion matrix of final ML model (a)NB (b)DT (c) SVM (d)RF (e) LR
We also constructed the ML models, as shown in Additionally, LR obtained a score of 98.73%, and SVM
Figure 11. A confusion matrix is a table that provides an achieved a distribution of 94.86% as shown in Figure 7.
overview of the performance of supervised algorithms. The TThese results were part of the evaluation process, which
entries (A) NB, (B) DT, (C) SVM, (D) RF, and (E) LR included assessing accuracy, precision, recall, F1-score, and
indicate the models used, and they show that the models the confusion matrix to evaluate the model's performance.
made some incorrect classifications. Among the models, the PPython was chosen for implementing the ML models due
DT model achieved the highest accuracy of 99.68%, to its extensive libraries and high efficiency.
followed by NB with 99.55% and RF with 99.03%.
Following the feature extraction process depicted in using a dataset to estimate the probability values and
Figures 13 and 14, the task was to identify unreliable or1141 analyze the dataset using three different methods. This
real news articles by calculating the probability of being real allowed for determining which model was more accurate in
or fake based on specific criteria. The models were trained classifying the news articles.