0% found this document useful (0 votes)
2 views

Final

This study presents a machine learning-based approach to detect fake reviews using Logistic Regression and TF-IDF vectorization, addressing the challenge of authenticity in online consumer feedback. The proposed system automates the detection process, improving accuracy and scalability compared to manual methods, and is applicable across various online platforms. The results demonstrate high precision and recall, making it a viable solution for enhancing trust in digital marketplaces.

Uploaded by

Archana suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Final

This study presents a machine learning-based approach to detect fake reviews using Logistic Regression and TF-IDF vectorization, addressing the challenge of authenticity in online consumer feedback. The proposed system automates the detection process, improving accuracy and scalability compared to manual methods, and is applicable across various online platforms. The results demonstrate high precision and recall, making it a viable solution for enhancing trust in digital marketplaces.

Uploaded by

Archana suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Monitoring Review of Product: Ensuring Authenticity in

Consumer Feedback
Elakkiya U Swathy M Kaviyasri R Shakthi R
Assistant Professor Information Technology Information Technology Information Technology
Sri Ramakrishna Institute of Sri Ramakrishna Institute of Sri Ramakrishna Institute of Sri Ramakrishna Institute of
Technology Technology Technology Technology
Coimbatore, India Coimbatore, India Coimbatore, India Coimbatore, India
[email protected] [email protected] [email protected] [email protected]

Abstract—Online reviews play a crucial role in shaping compromised the credibility of online marketplaces. Initially,
consumer decisions and establishing brand credibility. However, the review monitoring was conducted manually, relying on
increasing presence of fake reviews, aimed at artificially inflating or moderators and rule-based filtering techniques to identify
deflating product ratings, threatens the authenticity of online fraudulent content. However, these methods were inefficient,
feedback. This study proposes a machine learning-based approach prone to human error, and unable to scale with the vast number
for detecting fake reviews using Logistic Regression in conjunction of user-generated reviews. With the rise of machine learning
with Term Frequency- Inverse Document Frequency (TF-IDF) (ML) and natural language processing (NLP), automated
vectorization. The methodology involves data preprocessing, feature
review detection systems have been developed. Early
extraction, model training, evaluation, and real-time classification,
techniques, such as keyword-based filtering and sentiment
ensuring the system effectively distinguishes between genuine and
deceptive reviews. The TF-IDF technique is used to transform
analysis, struggled to differentiate between genuine and
textual data into numerical features, which are then analysed by the deceptive reviews. Recent advancements have introduced TF-
Logistic Regression model to identify patterns indicative of IDF vectorization to extract meaningful text features and
fraudulent reviews. The model is trained on a labelled dataset enhance classification models. This shift towards ML-based
consisting of both verified and fake reviews, and its performance is automation has significantly improved scalability, precision,
evaluated using metrics such as accuracy, precision, recall, and F1- and real-time fraud detection in online platforms.
score. Our experimental results indicate that Logistic Regression B. Problem Statement
achieves high precision and recall, making it a viable solution for
detecting fake reviews on e-commerce platforms, mobile app stores, Online reviews have become a powerful tool for consumers
and travel websites. The proposed system not only enhances trust in to evaluate products and services. Websites like Amazon, eBay,
digital marketplaces but also contributes to the broader field of fraud Flipkart, TripAdvisor, and Yelp rely on user-generated
detection and NLP-based classification models. feedback to build credibility, enhance customer trust, and
influence purchasing decisions. However, with the rise of
Keywords: Computer Generated Review Detection, Machine artificial intelligence (AI) and text-generation models, the
Learning, Logistic Regression, TF-IDF, Natural Language authenticity of online reviews is under serious threat. The
Processing, Fraud Detection. widespread presence of fake reviews on e-commerce platforms
has made it difficult for customers to make informed purchasing
I. INTRODUCTION decisions. While some platforms manually remove fake
In today’s digital age, online reviews have become a critical reviews, this approach is not scalable and cannot handle the vast
factor in consumer decision-making. Before purchasing a number of reviews posted daily. The increasing volume of
product or service, buyers often rely on user-generated reviews online reviews makes manual identification impractical,
to assess the quality, reliability, and effectiveness of an item. necessitating automated fraud detection mechanisms. This
Online marketplaces such as Amazon, Flipkart, Yelp, and eBay study introduces an ML-based fake review detection model
provide a platform for customers to express their opinions, using Logistic Regression and TF-IDF vectorization. The
which in turn influence other buyers and impact business sales. proposed approach aims to enhance classification accuracy,
However, not all reviews are genuine. The rise of fake reviews scalability, and adaptability to combat fraudulent review
has become a serious issue in e-commerce, as businesses, practices.
competitors, and unethical marketers often manipulate online C. Application
ratings for their advantage. Fake reviews are typically posted to
either Boost sales by generating false positive feedback for a The proposed fake review detection system has several
product. Damage competitors by leaving negative, misleading practical applications, including:
reviews. These fraudulent reviews mislead customers into • E-commerce Platforms: Identifying fake product reviews
purchasing low-quality products or avoiding legitimate on sites like Amazon and Flipkart..
businesses, causing financial losses and eroding trust in e-
commerce platforms. • App Stores: Filtering misleading ratings on Google Play
and the Apple App Store.
A. Historical Background
• Consumer Awareness: Helping users make informed
The presence of fake reviews has increasingly purchasing decisions based on authentic feedback
• Original Review (OR) (Label: 0) – These are verified,
D. Scope of the Project
authentic product reviews written by real users.
This project aims to develop an automated fake review
detection system that classifies reviews as genuine or fraudulent • Computer-Generated Review (CG) (Label: 1) – These
based on linguistic patterns, metadata, and behavioural indicators. are artificially created or misleading reviews designed
By leveraging Logistic Regression and TF-IDF vectorization, the to manipulate product ratings.
system enhances the accuracy of fraudulent content detection in To ensure balanced model training and evaluation, the dataset
various online platforms. The model is designed to be scalable was divided into three subsets:
and adaptable, ensuring effective deployment across different
• Training Set (80%) – Used to train the machine learning
sectors, including e-commerce, travel, and app marketplaces. The
models and learn textual patterns.
system integrates real-time data processing, enabling continuous
monitoring and classification of new reviews as they are posted. • Test Set (20%) – Evaluates the final model's
This approach not only improves fraud detection but also performance on unseen data.
minimizes the impact of fake reviews on consumer decision- Additionally, the dataset underwent preprocessing steps such as
making. Furthermore, the project incorporates data visualization text cleaning, stop word removal, stemming, and TF-IDF
tools, such as confusion matrices and bar charts, to provide clear vectorization to enhance classification accuracy. These steps
insights into the model’s performance. Future enhancements ensure the system can efficiently detect and filter deceptive
include deep learning models (LSTM), multilingual review reviews in real-world applications.
analysis, and user behavioural tracking, which will further
improve the system’s efficiency and accuracy. By implementing B. Data Preprocessing
this solution, businesses can maintain transparency, protect Data preprocessing is essential to standardize textual input
consumer trust, and ensure fair competition in digital and ensure consistency across the dataset. In this project, review
marketplaces. Additionally, regulatory bodies can use this system text was cleaned, tokenized, and transformed into a structured
to monitor and take action against deceptive marketing practices. format suitable for machine learning models. The key
preprocessing steps include:
E. Existing System
Currently, fake review detection in online platforms primarily • Text Cleaning – Removal of special characters,
relies on manual moderation and rule-based filtering. While some punctuation, and HTML tags to reduce noise in the
platforms implement basic keyword detection and sentiment dataset.
analysis, these methods struggle with identifying sophisticated • Lowercasing – Converting all text to lowercase to
fraudulent content. Additionally, user reports and flagging maintain uniformity and avoid duplicate representations
mechanisms are often employed to detect suspicious reviews, but of words.
this approach depends on subjective human intervention and can • Stop word Removal – Eliminating common words (e.g.,
be inconsistent. Some existing machine learning-based models "the," "and," "is") that do not contribute to sentiment or
attempt to classify fake reviews; however, they often suffer from meaning.
limited feature extraction, high false positive rates, and difficulty
generalizing across different review platforms. Many models rely • Lemmatization – Converting words to their root forms
solely on textual analysis, ignoring metadata and behavioural to standardize variations (e.g., "running" → "run").
patterns, which are crucial in detecting fraudulent activity. • Vectorization – Converting processed text into
Furthermore, current systems lack real-time processing numerical format using TF-IDF vectorization, ensuring
capabilities, leading to delays in identifying and removing effective representation for classification models.
deceptive reviews. The absence of scalable and adaptable TF-IDF
automated solutions presents a significant challenge for e-
commerce, travel, and service-based platforms, highlighting the TF-IDF (Term Frequency-Inverse Document Frequency) is
need for a more robust and comprehensive detection approach[1]. a statistical measure used to determine the importance of words
in a document. The TF-IDF score for a word t in document d is
II. METHODOLOGY given by:
A. Data Collection and Data Description
TF-IDF(t,d)=TF(t,d)×IDF(t) (1)
In the digital marketplace, product reviews play a significant
role in shaping consumer decisions and influencing brand where:
reputations. However, the increasing presence of fake reviews has • Term Frequency (TF): Measures how often a word
made it difficult for consumers to trust online feedback. To address appears in a document.
this challenge, this project aims to detect fake reviews using
Logistic Regression, leveraging a dataset of genuine and
computer-generated reviews. The dataset used in this project TF(t, d) = (Number of times term t appears in document d) /
consists of product reviews sourced from publicly available e- (Total number of terms in document d) (2)
commerce databases, particularly from Kaggle’s Amazon Review
Dataset. It includes both genuine reviews written by real users and
fake reviews generated by bots or manipulated through fraudulent • Inverse Document Frequency (IDF): Measures how
means. The dataset is labelled into two categories to facilitate unique a word is across all documents.
supervised learning:
IDF(t) = log (N / (n_t + 1)) (3) -βᵢ are the coefficients learned during training.
where N is the total number of documents and n_t is the -xᵢ are the TF-IDF features for each word.
number of documents containing term t. The +1 prevents -e is the natural logarithm base (Euler’s number, 2.718).
division by zero. The TF-IDF score represents the importance Decision Threshold
of a word in distinguishing real vs. Computer Generated
reviews. The model applies a threshold of 0.5 for classification:
C. Model Development • If P(Y = 1 | X) ≥ 0.5, the review is classified as Fake.
The core of this project is the Logistic Regression model, a • If P(Y = 1 | X) < 0.5, the review is classified as Original.
widely used algorithm for binary classification. Logistic D. Evaluation Metrics
Regression is chosen due to its efficiency, interpretability, and The model’s performance was evaluated using various
robustness in distinguishing between real and fake reviews. metrics to assess its classification capabilities. Metrics like
The model architecture consists of: accuracy, precision, recall, F1-score, and the confusion matrix
• Input Layer – Accepts pre-processed textual data were employed to provide a comprehensive evaluation.
represented as TF-IDF vectors. Evaluation metrics include:
• Feature Extractor – The TF-IDF vectorizer extracts • Accuracy – Measures overall classification correctness.
significant textual patterns from the reviews. • Precision – Evaluates the accuracy of positive
• Logistic Regression Classifier – A probabilistic model predictions.
that predicts the likelihood of a review being fake or • Recall – Assesses the model’s ability to identify positive
genuine. cases.
• Threshold-Based Classification – A review is • F1-Score – Balances precision and recall for an overall
classified as fake (CG) if the probability exceeds 0.5; effectiveness measure.
otherwise, it is labelled genuine (OR).
• Confusion Matrix – Visualizes the distribution of correct
The model was trained using Scikit-learn’s implementation of
and incorrect classifications across classes.
Logistic Regression, with a maximum iteration count of 500
to ensure convergence. Training was conducted on a dataset These metrics offer insights into the model’s strengths and
split into 80% training data and 20% testing data, enabling areas requiring improvement.
reliable performance evaluation.
III. RESULT AND DISCUSSION
A. Classification Report
A classification report is a performance evaluation metric
used to assess the effectiveness of a classification model. It
provides a detailed summary of key metrics such as precision,
recall, F1-score, and support for each class in a multi-class or
binary classification problem. This report helps in understanding
how well the model performs in predicting each class,
highlighting both strengths and weaknesses.

Fig. 1. Model Architecture

Logistic Regression Formula


The probability of a review being fake (CG) is modelled as:
Fig. 2. Classification Report

P(Y = 1 | X) = 1 / (1 + e^(-(β₀ + Σ βᵢxᵢ))) (4) • Accuracy – Accuracy measures the overall correctness
where: of the model by calculating the proportion of correctly
predicted instances (both positives and negatives) out of
-P(Y = 1 | X) is the probability that the review is fake. the total number of predictions.
-β₀ is the intercept. Accuracy = (TP + TN) / (TP + TN + FP + FN) (5)
• Precision – It measures the proportion of correctly predicted IV. CONCLUSION AND FUTURE SCOPE
positive instances out of all instances predicted as positive. It
focuses on how accurate the model’s positive predictions are, A. Conclusion
making it crucial in situations where false positives are costly. This project successfully developed an automated fake review
detection system capable of identifying computer-generated (CG)
Precision = (TP) / (TP + FP) (6)
fake reviews using machine learning techniques. By leveraging
• Recall – It measures the proportion of actual positive instances Natural Language Processing (NLP) and classification models,
that were correctly identified by the model. It emphasizes how the system effectively distinguishes AI-generated fake reviews
well the model captures all positive cases, which is vital in from genuine user reviews, addressing the growing issue of
cases like medical diagnosis where missing a positive case automated review manipulation on e-commerce platforms. The
(false negative) can be dangerous. preprocessing pipeline ensured that text data was cleaned,
Recall = (TP) / (TP + FN) (7) standardized, and transformed into numerical vectors, improving
• F1 Score – It is the harmonic mean of precision and recall, the model’s ability to recognize linguistic patterns unique to fake
providing a balance between the two metrics. It is especially reviews. TF-IDF vectorization was employed to extract key
useful when the data is imbalanced or when both false distinguishing words between real and fake reviews, while
positives and false negatives carry significant costs. Logistic Regression served as a lightweight and efficient
classifier for binary classification. To validate the model’s
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) (8) effectiveness, rigorous evaluation was conducted using
classification metrics such as accuracy, precision, recall, and F1-
score. A confusion matrix analysis demonstrated that the model
B. Confusion Matrix achieves high sensitivity and specificity, minimizing false
positives, where genuine reviews are misclassified as fake, and
A confusion matrix is a performance measurement tool for false negatives, where fake reviews are misclassified as real. The
classification problems, providing a detailed breakdown of model’s performance confirmed its ability to generalize well
how well the model’s predictions align with the true class across unseen data, making it a reliable tool for detecting AI-
labels. generated fake reviews. The system was designed to be
computationally efficient, ensuring quick response times without
the need for extensive computing resources. This makes it a
practical solution for businesses, e-commerce platforms, and
consumers seeking to verify the authenticity of online reviews.
This project demonstrates the effectiveness of machine learning
in detecting computer-generated fake reviews. By automating the
review authentication process, the system provides a scalable and
practical solution to combat fraudulent review practices, ensuring
trust and transparency in online marketplaces.
B. Future Work
Future improvements to this project can focus on enhancing
detection accuracy, scalability, and adaptability to evolving AI-
generated fake reviews. One key improvement is integrating
Long Short-Term Memory (LSTM) networks, which are deep
learning models capable of capturing long-term word
dependencies and understanding contextual patterns more
Fig. 3. Confusion Matrix
effectively than traditional machine learning models. This can
help the system better distinguish between naturally written user
reviews and AI-generated text. Additionally, incorporating multi-
It presents the results in a tabular format, where: language support will make the system more effective for global
e-commerce platforms. Deploying the model as a cloud-based
API can enable real-time processing, allowing businesses to
• Rows represent the actual classes of the data. integrate the fake review detection system directly into their
• Columns represent the predicted classes by the model. platforms for automated review moderation. Including behavioral
• True Positive (TP): Element of the first row and first and metadata analysis, such as identifying suspicious posting
column in the confusion matrix. patterns or analyzing reviewer credibility, can further enhance
detection capabilities. Another important development is the
• False Positive (FP): Sum of the elements of first column. creation of a mobile application or browser extension, allowing
• False Negative (FN): Sum of the elements of first row. consumers to check the authenticity of reviews before making
• True Negative (TN): Difference between the sum of all purchasing decisions. These improvements will make the system
elements except the diagonals and the sum of TP, FP more robust, scalable, and efficient in combating AI-generated
and FN. fake reviews across different online platforms.
ACKNOWLEDGEMENT
We express our sincere gratitude to Sri Ramakrishna Institute of
Technology for providing us with the opportunity and necessary
resources to carry out this project on Monitoring Review of
Product: Ensuring Authenticity in Consumer Feedback. We
extend our heartfelt thanks to our Principal, Dr. J. David
Rathnaraj, for his constant encouragement and support. We are
deeply indebted to Dr. J. J. Adri Jovin, Head of the Department
of Information Technology, for his invaluable guidance and
motivation throughout the project. Our sincere appreciation goes
to our Project Coordinator, Dr. T. C. Ezhil Selvan, for his
continuous support and insightful suggestions. A special thanks
to our Project Supervisor, Ms. U. Elakkiya, for his expert advice,
constructive feedback, and encouragement, which helped shape
our research work efficiently. We would also like to express our
gratitude to all faculty members of the Department of
Information Technology for their technical support and
knowledge-sharing during the course of this project. Finally, we
extend our heartfelt thanks to our parents, friends, and well-
wishers, whose unwavering support and encouragement played
a crucial role in the successful completion of this project.

REFERENCES
[1] Wang, J., Tang, S., & Zhang, H. Fake Review Detection Based on
Multiple Feature Fusion and Rolling Collaborative Training. IEEE
Access, Vol. 15, Issue 8, pp. 2075–2089, 2020.
[2] Tang, S., Liu, Y., & Zhang, J. Fraud Detection in Online Product
Review Systems via Heterogeneous Graph Transformer. IEEE Access,
Vol. 32, Issue 3, pp. 1234–1245, 2021.
[3] Mohawesh, R., Mulyana, T., & Abduallah, M. A Survey of Fake
Review Detection Techniques in E-Commerce. Journal of Electronic
Commerce Research, Vol. 22, Issue 4, pp. 45–61, 2021.
[4] Tufail, H., H. Fake Reviews Detection in Online Shopping Platforms
During the COVID-19 Era. Journal of Computer Science and
Technology, Vol. 39, Issue 11, pp. 302–315, 2024.
[5] Liu, M., Zhang, L., & Wu, Q. Detecting Fake Reviews Using
Multidimensional Representations with Fine-Grained Aspects Plan.
IEEE Transactions on Neural Networks and Learning Systems, Vol. 33,
Issue 9, pp. 2765–2779, 2021.
[6] Tufail, H. The Effect of Fake Reviews on e-Commerce During and
After Covid-19 Pandemic: SKL-Based Fake Reviews Detection. IEEE
Access, Vol. 10, pp. 25555–25564, 2022.
[7] Abulqader, M., Sadeghi, M., & Abdullah, S. Unified Fake Review
Detection Using Deception Theories. Springer Journal of Artificial
Intelligence Research, Vol. 70, Issue 5, pp. 599–613, 2022.
[8] Ennaouri, M., Benazzouz, A., & Boudraa, R. A Comprehensive Review
of Sentiment Analysis Techniques in Fake Review Detection. Springer
Journal of Computational Intelligence, Vol. 21, Issue 4, pp. 1482–1495,
2023.
[9] Zhang, S., Wei, Z., & Li, K.-C. Building Fake Review Detection Model
Based on Sentiment Intensity and PU Learning. IEEE Transactions on
Neural Networks and Learning Systems, Vol. 34, Issue 10, pp. 6926–
6939, 2023.
[10] Abedin, E., & Boudraa, R. Understanding the Credibility of Online
Drug Reviews. Springer Journal of Health Informatics, Vol. 25, Issue
6, pp. 1324–1342, 2024.

You might also like