ML Project Report
ML Project Report
A PROJECT REPORT
Submitted by
BONAFIDE CERTIFICATE
This is to certify that this is the bonafide record of work done by Aaryan Pathak [Reg
No: RA2211003030257] of the 5th semester, 3rd year B.TECH degree course in SRM
INSTITUTE OF SCIENCE AND TECHNOLOGY, NCR Campus, Department of
Computer Science & Engineering, in the field of Machine Learning, during the
academic year 2024-2025.
SIGNATURE SIGNATURE
BONAFIDE CERTIFICATE
This is to certify that this is the bonafide record of work done by Aaditya Srivastava
[Reg No: RA2211003030266] of the 5th semester, 3rd year B.TECH degree course in
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY, NCR Campus, Department of
Computer Science & Engineering, in the field of Machine Learning, during the
academic year 2024-2025.
SIGNATURE SIGNATURE
BONAFIDE CERTIFICATE
This is to certify that this is the bonafide record of work done by Aman Mathur [Reg
No: RA2211003030273] of the 5th semester, 3rd year B.TECH degree course in SRM
INSTITUTE OF SCIENCE AND TECHNOLOGY, NCR Campus, Department of
Computer Science & Engineering, in the field of Machine Learning, during the
academic year 2024-2025.
SIGNATURE SIGNATURE
1 Abstract
2 Introduction
4 Technologies used
5 Architectural Diagram
6 Methodology
7 Related Work
8 Future Work
9 Conclusion
10 References
Abstract
This sentiment analysis project applies sophisticated data processing and natural
language processing (NLP) methodologies to clean and prepare extensive text data for
accurate sentiment classification. Utilizing TF-IDF (Term Frequency-Inverse Document
Frequency) vectorization, raw text data is transformed into a structured, numerical format
that captures the importance of words across the dataset, making it suitable for input into
machine learning algorithms. At the core of the classification process is a Naive Bayes
classifier, chosen for its efficiency in text-based sentiment categorization. This algorithm
excels in distinguishing between positive and negative sentiments, leveraging
probability-based assessments to offer reliable predictions.To enhance user engagement
and accessibility, the project incorporates a user-friendly interface built with Streamlit,
which facilitates intuitive input, real-time analysis, and visualized output of sentiment
predictions. Additionally, SQLite is employed as the underlying database solution,
providing robust data storage capabilities that handle large datasets efficiently and
ensuring seamless retrieval and logging of user interactions.Upon inputting text, the
system preprocesses it by removing noise, tokenizing, and applying vectorization,
followed by sentiment prediction. This pipeline delivers a fast, visually clear sentiment
result for users, offering immediate insights into the emotional tone of the provided text.
Throughout the development process, several challenges, including the handling of
extensive datasets, optimizing processing times, and integrating machine learning with a
responsive interface, were systematically addressed. Fine-tuning model parameters and
refining data preparation methods were pivotal steps that contributed to the overall
accuracy and usability of the platform. The final product serves as a comprehensive
sentiment analysis tool, enabling users to explore sentiment in text data interactively and
efficiently.
Introduction
The rise of user-generated content, especially on social media platforms like Twitter,
Facebook, and product review websites, has increased the demand for sentiment analysis.
By automating this process, businesses can scale the monitoring and understanding of
public opinion, helping them make data-driven decisions. From customer feedback to
brand reputation, sentiment analysis transforms raw textual data into actionable insights.
In the case of sentiment analysis, classifying text can be tricky due to nuances such as
sarcasm, mixed sentiments, and context-dependence. Simpler methods such as Naive
Bayes classifiers are often fast and effective for basic sentiment classification tasks but
may not capture the full complexity of human expression compared to more sophisticated
models like BERT or LSTMs. Still, Naive Bayes provides a good balance of simplicity,
computational efficiency, and effectiveness for many practical applications, particularly
when rapid, real-time feedback is needed.
Input:
Output:
Naive Bayes is a probabilistic classifier that applies Bayes’ Theorem with the “naive”
assumption that the features (in this case, words) are conditionally independent given the
target label (positive or negative sentiment). Despite this simplifying assumption, Naive
Bayes performs remarkably well in text classification tasks, particularly because it can
handle the high dimensionality of text data efficiently.
The variant of Naive Bayes used in this project is Multinomial Naive Bayes, which is
particularly suited to text classification tasks where the features (words) represent
frequencies or counts.
In this case, the text is converted into a feature vector representing the frequency of each
word, and the Naive Bayes classifier predicts the class (positive or negative) based on the
probabilities learned during training.
Text data, in its raw form, cannot be directly fed into a machine learning model.
Therefore, a numerical representation of text is required. TF-IDF (Term
Frequency-Inverse Document Frequency) is a popular text vectorization technique that
transforms textual data into numerical vectors while also capturing the importance of
each word relative to the document.
● Term Frequency (TF) measures how often a word appears in a document, i.e., the
frequency of the word divided by the total number of words in the document.
● Inverse Document Frequency (IDF) downscales words that appear frequently
across all documents, giving more weight to words that are unique to a particular
document. It is calculated as:
TF-IDF Score is the product of these two values. TF-IDF vectorization ensures that
commonly used words like "the," "is," and "and" do not dominate the sentiment analysis
model, while more meaningful words that appear infrequently but are important to
sentiment (such as "excellent" or "terrible") are weighted more heavily.
The workflow for classifying text sentiment using the Naive Bayes algorithm typically
follows these steps:
1. Preprocessing the text: Input text is cleaned to remove punctuation, numbers, and
other non-informative elements. It is then tokenized (split into individual words or
tokens), and commonly used words (stopwords) are removed to focus on
meaningful content.
3. The logic behind TF-IDF is that words that appear frequently in a document but
rarely across the entire dataset carry more weight, as they likely represent key
topics or sentiments specific to that document. Conversely, words that appear in
most documents, such as common articles and prepositions, are given lower weight
since they don't contribute much to differentiating the sentiment.
4. Training the model: The Naive Bayes classifier is trained on a labeled dataset
where each text sample is already tagged as either positive or negative. During
training, the model learns the probabilities associated with each word given the
sentiment label. For example, the word “excellent” may have a high probability of
being associated with positive sentiment, whereas “terrible” may have a high
probability of indicating negative sentiment.
5. Predicting sentiment: Once trained, the model can take a new, unseen piece of
text, apply the same preprocessing and TF-IDF transformation, and compute the
likelihood of the text being positive or negative. Based on these probabilities, the
model assigns a sentiment label to the text.
The Naive Bayes algorithm has several advantages in the context of this project. Its
simplicity makes it computationally efficient, even with large amounts of data.
Additionally, its performance, while based on a relatively simple statistical model, is
competitive with more complex algorithms for many text classification tasks.
However, one limitation of Naive Bayes is that it assumes all features are equally
important and independent of one another, which may not always be the case with human
language. Despite this, the classifier’s ability to generalize well from small datasets,
coupled with its speed and effectiveness, makes it an ideal choice for building a
sentiment analysis tool where real-time feedback and ease of use are critical.
In summary, the combination of the Naive Bayes classifier with TF-IDF feature
extraction offers a powerful yet simple approach to sentiment classification, making this
web application an accessible tool for users looking to analyze and understand the
emotional tone of textual data.
2.2.4 Pseudocode
- Predict positive or negative sentiment using the trained Naive Bayes classifier
2.2.5 Example
To better illustrate how the algorithm works, consider the following example:
● Ease of Use: Python’s simple syntax allows rapid prototyping and clear, readable
code.
● Extensive Libraries: Python offers a variety of powerful libraries for data
preprocessing, machine learning, and visualization, which are essential for the
development of machine learning projects.
● Strong Community Support: The Python ecosystem is supported by a vibrant
community, providing extensive resources, tutorials, and documentation.
3.2 Scikit-learn
Scikit-learn is one of the most widely used machine learning libraries in Python, and it
plays a central role in the implementation of the Naive Bayes classifier for this project.
Scikit-learn provides an easy-to-use API for various machine learning algorithms and
preprocessing tools. The following Scikit-learn components were utilized:
Pandas simplifies many aspects of data handling, making it a key technology for the data
preprocessing steps of this project.
Architectural Diagram
The diagram illustrates the system architecture for the sentiment analysis model,
showcasing the flow of data and interactions between different components of the
system. Each part of the system is designed to ensure a smooth user experience, efficient
processing, and accurate sentiment prediction. Below, we provide a detailed explanation
of each component and its role in the overall architecture.
● Guest/Sign Up: New users can sign up or log in as guests to access the sentiment
analysis system. This ensures that user activity can be tracked and personalized.
The system verifies the user's credentials to ensure secure access.
● User Authentication: For registered users, an authentication module checks the
validity of their credentials. If invalid credentials are provided, the system denies
access and requests the user to try again.
● Profile Management: Once a user successfully logs in, they have access to profile
management. Here, users can update their personal information, view past
sentiment analysis results, and manage other preferences. This module provides a
personalized experience to the users by maintaining their history.
● Input Handling: After the user logs in or signs up, they provide the input text that
they wish to analyze for sentiment. This module is responsible for ensuring that the
input text is valid (non-empty, correctly formatted) and passes it to the
preprocessing stage.
If invalid input is detected, the system will notify the user to revise the input before
proceeding.
● Preprocessing & Data Preparation: The raw text input is processed and cleaned.
This involves tokenization, stopword removal, lemmatization, and converting the
text into a suitable numeric format using TF-IDF (Term Frequency-Inverse
Document Frequency) vectorization.
In this phase, sentiment labels are added to numeric values, preparing the data for
the training or prediction phases. The cleaned and structured data is passed to the
sentiment model for analysis.
● Train Sentiment Model: This component is where the machine learning model is
trained. For this project, the Naive Bayes classifier is the core model used for
sentiment analysis. The cleaned data, with corresponding labels, is fed into the
model for training.
The system ensures that the model learns from both positive and negative
examples, allowing it to classify unseen input with a certain level of accuracy.
● Train Naive Bayes Classifier: The Naive Bayes classifier is trained on the
preprocessed data to learn patterns that distinguish between different sentiments
(positive or negative). Once trained, this classifier becomes the main tool used for
predicting the sentiment of future inputs.
The accuracy of the model is recorded, and adjustments to hyperparameters can be
made if needed to improve performance.
● Sentiment Prediction: Once the model is trained, it can predict the sentiment of
new text inputs. The user-provided text is passed through the trained classifier,
which outputs whether the sentiment is positive or negative. Additionally, the
classifier provides an accuracy score for its prediction.
The result (positive/negative sentiment) is displayed to the user, offering real-time
feedback based on the analysis.
● Store Results in SQLite: After the prediction, the result is saved in an SQLite
database. This ensures that users' past results can be stored and retrieved for future
reference. SQLite is a lightweight, serverless database that allows for efficient
storage and retrieval of the sentiment analysis history.
This component is critical for enabling future comparisons and visualizations
based on historical data.
4.8 Visualization and Reporting
● Visualize Previous Results: The results stored in the SQLite database can be
visualized in various formats, such as tables or bar charts. This module allows
users to view a graphical representation of their previous sentiment analysis
results, offering insights into trends and patterns over time.
Users can observe their results, compare multiple entries, and analyze the
performance of the sentiment analysis model based on the historical data.
To evaluate the performance of our sentiment analysis model, we used the Naive Bayes
classifier, which was trained on a sample dataset consisting of text data labeled as either
positive or negative. Before training, the text was preprocessed by converting it to
lowercase, tokenizing it, and removing common stopwords. The data was then
transformed into numerical vectors using the TF-IDF (Term Frequency-Inverse
Document Frequency) technique, which measures the importance of words in relation to
the entire dataset.
We split the dataset into two subsets: 80% of the data was used for training the model,
and the remaining 20% was held out for testing. This split was done using the
train_test_split method from the sklearn library, which ensures a random and balanced
division of data into training and testing sets.
These metrics are critical for understanding how well the model generalizes to unseen
data and handles real-world sentiment classification tasks.
5.1 Results
After training the Naive Bayes model on the 80% training set, we evaluated its
performance using the remaining 20% of the test data. The model achieved an accuracy
of 77%, meaning that it correctly predicted the sentiment of X% of the test instances.
● Precision: 80% (for positive sentiment) and 75% (for negative sentiment)
● Recall: 73% (positive) and 82% (negative)
● F1-Score: 76% (positive) and 78% (negative)
These results were summarized in a classification report, which breaks down the
performance of the model for each class (positive and negative). Overall, the Naive
Bayes classifier demonstrated satisfactory performance, especially in distinguishing
between clear-cut positive and negative sentiment examples.
In addition to these numerical metrics, we visualized the model’s predictions using key
visualizations:
1. Bar Chart: A comparison of the number of positive vs. negative predictions. This
chart highlights any bias the model may have in predicting one sentiment over the
other, which is important for maintaining balance in the predictions.
5.2 Discussion
The results of our evaluation indicate that the Naive Bayes classifier is effective for basic
sentiment classification tasks. The high accuracy, combined with good precision and
recall scores, suggests that the model can reliably predict positive or negative sentiment
for straightforward text inputs. This is expected, as the Naive Bayes algorithm is known
for its performance in text classification tasks due to its simplicity and scalability.
However, the model’s limitations become apparent when handling more complex inputs.
For instance, text containing sarcasm, irony, or mixed sentiments often confuses the
classifier. Sarcasm is particularly challenging because the literal meaning of words may
not reflect the actual sentiment of the sentence, leading to misclassification. Similarly,
inputs that contain both positive and negative sentiments in a single sentence can be
difficult for the model to categorize correctly.
While the TF-IDF vectorization provides a solid numerical representation of text data, it
does not capture more nuanced elements of language, such as context, tone, or sentiment
flow within a paragraph. As a result, future work might involve using more advanced
methods such as word embeddings (e.g., Word2Vec or BERT) or incorporating sentiment
lexicons that better handle complex text features.
In conclusion, while the Naive Bayes classifier provides a strong baseline for sentiment
analysis, its performance can be enhanced with more sophisticated algorithms or
techniques, particularly for handling intricate forms of language such as sarcasm or
mixed sentiments.
Related Work
Sentiment analysis, a core task in Natural Language Processing (NLP), has evolved
significantly over the years. In its earlier stages, sentiment classification primarily relied
on traditional machine learning techniques such as Naive Bayes, Support Vector
Machines (SVMs), and Logistic Regression. These models, when paired with basic text
vectorization methods like Bag of Words (BoW) and Term Frequency-Inverse
Document Frequency (TF-IDF), formed the foundation of sentiment analysis tasks.
BoW and TF-IDF represent text in a numerical format based on word frequencies and
importance, enabling these algorithms to perform classification by recognizing patterns
in word occurrences.
Naive Bayes, due to its simplicity and efficiency, has been a popular choice for text
classification tasks, including sentiment analysis. Its assumption of feature independence,
while not always true, simplifies computations and often leads to surprisingly robust
performance. Support Vector Machines (SVMs), another traditional method, have also
been widely used for sentiment classification, offering solid performance through margin
maximization and kernel tricks. While these approaches are fast and computationally
efficient, they struggle with capturing complex contextual relationships between words in
sentences.
Despite the effectiveness of these advanced models, they come with notable trade-offs.
Deep learning-based approaches require large amounts of labeled training data, extensive
computational resources, and more time to train compared to simpler models. In contrast,
traditional models like Naive Bayes and SVMs are computationally inexpensive and
perform well on straightforward text classification tasks where speed and scalability are
essential. For real-time applications or systems operating under resource constraints,
simpler models may still be the better option.
In this project, we explored the use of the Naive Bayes classifier combined with TF-IDF
vectorization to implement a real-time sentiment analysis system. Despite its simplicity,
the Naive Bayes model demonstrated competitive performance in classifying text into
positive or negative sentiment categories. While it may not achieve the same level of
accuracy as modern deep learning models in more nuanced or context-rich settings, it
provides a practical solution for real-time applications with limited computational power
and data.
Our project highlights that although cutting-edge models like BERT and LSTMs
dominate the field in terms of performance, traditional machine learning models remain
valuable. Especially in situations where interpretability, speed, and low computational
overhead are prioritized, Naive Bayes and similar algorithms offer a solid balance of
performance and practicality. This demonstrates that sentiment analysis, even with
simpler models, can still achieve meaningful results, particularly in basic classification
tasks.
Future Work
There are several avenues for enhancing the current sentiment analysis system. One
promising direction is the integration of advanced models like BERT or other
transformer-based architectures. These models, known for their superior ability to capture
contextual relationships and nuances in language, would significantly improve sentiment
classification accuracy, particularly on more challenging datasets that include sarcasm,
mixed sentiments, or complex sentence structures. Incorporating BERT would enable the
system to handle subtler aspects of language, such as negations or implicit sentiments,
where simpler models like Naive Bayes tend to fall short.
While the results achieved with the Naive Bayes model are promising, they also
illuminate certain limitations, particularly in handling complex linguistic nuances and
varied sentiment expressions. As sentiment analysis often involves subjective
interpretation, the model's performance may degrade when faced with challenging inputs
such as sarcasm or mixed emotions. Recognizing these challenges opens up pathways for
future work aimed at enhancing the application's capabilities.
Looking ahead, integrating more sophisticated models like BERT or other deep learning
architectures will be pivotal in overcoming the current limitations. These models can
capture the intricacies of language far more effectively, allowing for better contextual
understanding and improved accuracy in sentiment classification. Additionally,
expanding the application to support multilingual text and a broader range of sentiment
categories will significantly enhance its usability and relevance in a diverse set of
contexts.
In conclusion, this project serves as a solid foundation for further exploration into
sentiment analysis through machine learning. By addressing the existing limitations and
embracing future improvements, we aim to develop a more robust, versatile, and
user-friendly sentiment analysis tool that can effectively meet the evolving needs of users
and organizations alike. The journey from simple sentiment classification to a
comprehensive analysis platform is not only an exciting challenge but also a valuable
contribution to the field of natural language processing.
References
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research, 12, 2825-2830.
● Documentation for the Scikit-learn library, which was used for implementing
Naive Bayes and TF-IDF transformations.
Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. arXiv preprint
arXiv:1810.04805.
● Explores advanced language models like BERT, which could be considered for
future work in improving sentiment analysis accuracy.
SRM Institute of Science and Technology. (2024). Machine Learning Course Outline.
Department of Computer Science & Engineering, NCR Campus.