Fyp 1
Fyp 1
Project Advisor:
Syed Zeeshan Ali
Submitted By
Declaration
We have read the project guidelines and we understand the meaning of academic dishonesty,
in particular plagiarism and collusion. We hereby declare that the work we submitted for our final
year project, entitled Tweets Spamming Analysis & Sentiment Analysis is original work and
has not been printed, published or submitted before as a final year project, research work,
publication or any other documentation.
Signature: …………………………
Group Member 2 Name: Muhammad Farooq
Signature: …………………………
Signature: …………………………
Statement of Submission
This is to certify that Bazed Gul Roll No. 7113413, Muhammad Farooq Roll No. 70110650
and Taimoor Asghar Roll No. 7O110725 have successfully submitted the final project named
as: Tweets Spamming Analysis & Data Scrapping, at the Software Engineering Department,
The University of Lahore, Lahore Pakistan, to fulfill the partial requirement of the degree of BS
in Software Engineering.
Signature: …………………………
Date: ………………………
Dedication
This project is dedicated to my father, who taught me that the best kind of knowledge to have is
that which is learned for its own sake. It is also dedicated to my mother, who taught me that
even the largest task can be accomplished if it is done one step at a time.
Acknowledgment
We extend my heartfelt gratitude to my project supervisor, Sir Syed Zeeshan, for his unwavering
support and guidance throughout this journey. His constant motivation was a catalyst for our
project's progress, always pushing us to strive for excellence. His insightful feedback provided
invaluable direction, shaping our work and refining our ideas. Beyond expert guidance, Sir Syed
Zeeshan's approachable nature and willingness to help whenever needed created a
comfortable environment where we could thrive. We are truly grateful for his dedication and his
commitment to our success
Date:
Abstract
This project tackles the dual challenge of spam classification and sentiment analysis on social
media platforms, specifically focusing on Twitter. We develop a web-based application that
empowers users with valuable insights into the digital atmosphere of Twitter. Employing a
combination of data scraping techniques and Natural Language Processing (NLP) methods, our
application performs:
Our project sits at the vibrant intersection of Artificial Intelligence and Social Media analysis,
where we're building a powerful web-based application to tackle the two-headed beast of spam
and sentiment on Twitter. We're weaving intricate strands of Natural Language Processing
(NLP) into a shield against spam, using rule-based and machine-learning algorithms to identify
and filter unwanted content. On the other hand, we're wielding the delicate scalpel of sentiment
analysis, dissecting tweets to unveil the hidden emotions and opinions of the Twitterverse. But
our ambition doesn't stop there. We're also breathing life into a friendly and intelligent chatbot,
your own digital guide through the ever-evolving landscape of Twitter trends and insights.
Technologies used
Technologies used in our project e.g. HTML, CSS, JavaScript, Reactjs, Python, MongoDB etc.
List of Figures
Figure 3 ERD.. 7
List of Tables
Declaration. i
Statement of Submission. ii
Dedication. ii
Acknowledgment iv
Abstract v
1.1 Introduction. 1
1.2 Purpose. 1
1.3 Objective. 1
2.1 Introduction. 1
2.1.1 Purpose. 1
2.1.2 Scope. 1
2.1.3 Definitions, acronyms, and abbreviations. 1
2.2.4 Constraints. 3
Chapter 4: Design. 6
References. 16
Appendix. 17
1.2 Purpose
Our purpose is to empower users with a web-based application that tackles the twin challenges
of spam classification and sentiment analysis on Twitter. We believe that understanding the
authenticity and emotional undertones of online communication is crucial for navigating the
often chaotic world of social media.
1.3 Objective
Our key objectives are:
● Develop a robust spam detection system: Utilizing a combination of rule-based and
machine-learning algorithms, we aim to identify and filter unwanted content, ensuring
users see genuine voices on Twitter.
● Implement accurate sentiment analysis: By leveraging state-of-the-art NLP techniques,
we will extract the emotional tone from tweets, providing users with valuable insights into
public opinion and audience response.
● Craft a user-friendly chatbot: Integrating a conversational AI companion, we will offer
users a guide through the Twitterverse, allowing them to ask questions, analyze specific
tweets, and engage in interactive dialogue.
2.1 Introduction
This document defines the software requirements for [Project Name], a web-based application
designed to combat spam and analyze sentiment on Twitter. It details the software's intended
purpose, scope, and functionalities, providing a clear understanding of its goals and desired
behavior.
2.1.1 Purpose
The purpose of this application is to:
● Detecting and filtering spam tweets: Identify unwanted content such as advertising bots
and malicious messages, providing users with a cleaner and more authentic Twitter
experience.
● Analyze the sentiment of tweets: Extract the emotional tone (positive, negative, neutral)
behind tweets, enabling users to understand public opinion and gauge audience
response.
● Offer an interactive chatbot: Provide users with a conversational AI companion to
answer questions about trending topics, analyze specific tweets, and engage in dialogue
2.1.2 Scope
2.1.3 Definitions
● Spam: Unwanted or irrelevant content, including promotional tweets, bots, and malicious
messages.
● Sentiment Analysis: The process of automatically extracting the emotional tone (positive,
negative, neutral) from text.
● Natural Language Processing (NLP): A field of computer science concerned with the
interaction between computers and human (natural) languages.
● Chatbot: A computer program that simulates a conversation with human users.
● API: Application Programming Interface, a set of functions and protocols that allows one
program to communicate with another.
From the user's vantage point, [Tweets Spam Analysis and Sentiment Analysis] seamlessly integrates
into their Twitter experience, offering a suite of powerful features without disrupting their usual
interactions.
System Interfaces:
● Seamless integration with Twitter: The application effortlessly connects with the Twitter API,
enabling users to analyze tweets, access their feeds, and interact with Twitter data without
leaving the platform.
● Browser-based accessibility: Users can access the application through any modern web browser,
eliminating the need for specific hardware or software installations.
User Interfaces:
● Intuitive interface: The design prioritizes user-friendliness, with clear navigation, informative
visualizations, and interactive elements.
● Streamlined spam filtering: Users can easily identify and filter spam tweets with a single click or
customize filtering thresholds based on their preferences.
● Visual sentiment analysis: Sentiment scores are presented, using color-coding or charts, making
it effortless to grasp the emotional tone of tweets and discussions.
● Chatbot integration: The chatbot seamlessly blends into the interface, providing a natural way to
ask questions, analyze sentiment, and engage in conversations, enhancing the user experience.
Hardware Interfaces:
● Minimal hardware requirements: The web-based nature of the application means it runs smoothly
on most devices with a stable internet connection, including smartphones, tablets, and
computers.
Software Interfaces:
● Compatibility with major browsers: The application is compatible with popular browsers like
Chrome, Firefox, Safari, and Edge.
Communications Interfaces:
● Secure communication with Twitter API: The application uses secure protocols (HTTPS) to
protect user data when interacting with the Twitter API.
Memory:
● Optimized for efficient performance: The application is designed to minimize memory usage,
ensuring smooth operation even on devices with limited resources.
Operations:
● Automatic updates: The application stays up-to-date with the latest features and security
enhancements through automatic background updates.
● Data backup and recovery: Options for backing up and restoring user data are available, ensuring
information security.
● Flexible configuration: The application can be customized to align with specific user preferences
or site-specific requirements, allowing for tailored experiences.
By seamlessly integrating these technical elements, [Tweets Spam Analysis and Sentiment Analysis]
empowers users to navigate Twitter with clarity and confidence, offering a transformative experience that
filters noise, amplifies authentic voices, and unlocks the emotional landscape of the Twitterverse—all
within a user-friendly and accessible interface.
● Spam Detection:
○ Analyze tweets for spam indicators.
○ Assign spam probability scores.
○ Filter tweets based on spam score thresholds.
○ Flag potential spam for user review.
● Sentiment Analysis:
○ Analyze the sentiment of individual tweets.
○ Aggregate sentiment scores for hashtags, users, and topics.
○ Visualize sentiment analysis results.
● Chatbot Interaction:
○ Understand natural language queries about Twitter trends and user analysis.
○ Access and process relevant Twitter data.
○ Provide informative and engaging responses.
○ Maintain a conversational tone.
2.2.4 Constraints
● Technical limitations: Consider any limitations of the chosen technologies, APIs, or
platforms. For example, Twitter API rate limits, computational resource limitations for
NLP tasks, or browser compatibility limitations.
● Resource constraints: Be mindful of budget limitations, available personnel, and
development timelines.
● Legal and ethical considerations: Account for regulations and user privacy concerns
when collecting and analyzing Twitter data.
Spam Analyze tweets for High Team 2 weeks Twitter API, Machine
Detection spam markers, assign A Learning libraries
scores, filter options
Sentiment Analyze sentiment of High Team 3 weeks NLP libraries, Data storage
Analysis tweets, aggregate A
scores, visualize
outputs
This section will describe the functional and non-functional requirements of the System at a
sufficient level of detail for the designers to design a system satisfying the User
requirements and tests to verify that the system satisfies the requirements.
2.3.1 Functional Requirement
● 2.3.1.1 Spam Detection:
○ The application should analyze tweets for spam indicators such as keywords,
hashtags, suspicious links, and unusual posting patterns.
○ It should assign a spam probability score to each tweet, allowing users to filter or
flag potential spam.
○ Different filtering options should be provided, based on spam probability score
thresholds.
● 2.3.1.2 Sentiment Analysis:
○ The application should analyze the sentiment of tweets using NLP techniques
like lexicon-based analysis or machine learning models.
○ Sentiment analysis should be performed on individual tweets and aggregated for
specific hashtags, users, or topics.
○ The results should be presented visually, using charts or graphs, for easy
interpretation.
● 2.3.1.3 Chatbot:
○ The chatbot should understand natural language queries related to Twitter
trends, user analysis, and general information.
○ It should be able to access and process relevant Twitter data using the Twitter
API.
○ The chatbot should provide informative and engaging responses while
maintaining a conversational tone.
Use Case
Signup Login
Name
Secondary
System System
Actor
Post- User account is created, and user is User is authenticated and granted
Condition logged in. access to the application.
1. User enters username, email,
password, and confirmation of 1. User enters username or email
password. 2. System validates input address and password. 2. System
and checks for duplicate accounts. 3. verifies credentials against stored
Basic Flow System creates account and stores user information. 3. System grants
user information. 4. System sends access upon successful
confirmation email (optional). 5. authentication and displays the
System logs user in and displays user dashboard or homepage.
profile page.
1a. Invalid input (e.g., missing field, 1a. Invalid credentials: System
wrong format): System prompts user to displays an error message and
correct errors. 1b. Duplicate username prompts user to retry. 1b. Account
Alternate Flow or email: System prompts user to locked: System displays a
choose a different username or email. message indicating the account is
3a. Email delivery failure: System locked and provides instructions for
prompts user to verify email address. unlocking.
Post-Condition User receives a response from the chatbot or completes their desired
action.
Alternate Flow
Chapter 4: Design
In this section, we provide the design analysis of our modules including the following designs
1. Architecture Diagram
2. ERD with data dictionary
3. Data Flow Diagram
4. Class Diagram
5. Activity Diagram
6. Sequence Diagram
7. Collaboration Diagram
8. State Transition Diagram
9. Component Diagram
10. Deployment Diagram
Overall Architecture:
The architecture appears to be a pipeline system, where data flows through various stages for processing
and analysis. Tweets are the main input, and the system outputs sentiment analysis results and spam
classification labels.
Key Components:
● Tweepy API: This component interacts with the Twitter API to retrieve tweets based on specific
criteria (e.g., keywords, hashtags).
● Tweet Pre-Processing: This stage cleans and prepares the tweet text for further analysis. It might
involve tasks like:
○ Filtering: Removing irrelevant content like usernames, URLs, and special characters.
○ Tokenization: Breaking down the text into individual words or phrases.
○ Normalization: Converting words to lowercase, stemming/lemmatization (reducing words
to their root form).
● Spam Detection: This stage analyzes the processed tweets to identify potential spam based on
various features and machine learning models.
○ Features: The system might extract features like keywords, hashtags, suspicious links,
unusual posting patterns.
○ Models: Different machine learning models, such as Logistic Regression, Naive Bayes, or
Support Vector Machines, could be used to classify tweets as spam or ham (non-spam).
● Sentiment Analysis: This stage analyzes the processed tweets to determine their emotional tone
(positive, negative, or neutral).
○ Techniques: This could involve lexicon-based analysis using sentiment dictionaries or
supervised machine learning models trained on labeled sentiment data.
● Testing Classifiers: This stage evaluates the performance of the spam detection and sentiment
analysis models using separate testing datasets. Different classifiers are compared to choose the
most accurate ones for deployment.
● Classifier with Highest Accuracy: The chosen classifiers for both spam detection and sentiment
analysis are used to process incoming tweets in the main pipeline.
● Classifying Given Tweet: This final stage applies the chosen classifiers to the processed tweet,
giving it a spam probability score and a sentiment label.
Figure 1 Architecture Diagram
Relationships:
The ERD shows a one-to-many relationship between User and Tweet. This means that
one user can have many tweets, but each tweet belongs to only one user. This
relationship is enforced by the foreign key (userID) in the Tweet table, which references
the primary key (userID) in the User table.
Additional Notes:
● The ERD also specifies the data types for each attribute. For example, userID
and tweetId are integers, while username, email, password, and tweetText are
strings.
● The ERD does not show any constraints on the length of the attributes. However,
there are likely constraints on the lengths of username, email, password, and
tweet text.
Figure 3 ERD
Overall Functionality:
● The diagram depicts a system that interacts with Twitter to facilitate the sending
and receiving of tweets.
● Users can either enter a query to search for tweets or compose a new tweet to
be posted.
● The system communicates with the Twitter API to retrieve or send tweets as
needed.
● The retrieved tweets or confirmation of a posted tweet are then displayed to the
user.
Key Points:
External Entity:
● User: This represents the person interacting with the system, likely entering
queries or composing tweets.
● Twitter API: This is the external service that provides access to Twitter data, such
as retrieving tweets based on specific criteria.
Processes:
● System: This encompasses the main functionalities of the system, further broken
down into subprocesses:
○ Validate User Input: Ensures the user's query or tweet adheres to
formatting requirements and is suitable for processing.
○ Construct API Request: Builds the appropriate request to send to the
Twitter API based on the validated user input.
○ Send Request to Twitter API: Transmits the constructed request to the
Twitter API.
○ Receive Response from Twitter API: Obtains the response containing
tweets from the Twitter API.
○ Preprocess Tweets: Cleans and prepares the received tweets for further
analysis, likely involving removing irrelevant information and tokenizing the
text.
○ Extract Features: Identifies relevant features from the preprocessed
tweets that can be used for analysis, such as n-grams or sentiment-
related word frequencies.
○ Spam Detection: Analyzes the extracted features to classify the tweets as
either spam or non-spam (ham). This might involve machine learning
algorithms trained on labeled data.
○ Sentiment Analysis: Analyzes the extracted features to classify the tweets
as positive, negative, or neutral sentiment. This could also involve
machine learning algorithms trained on labeled data.
○ Store Results: Saves the analyzed tweets and their associated labels
(spam/ham and sentiment) in a persistent storage system for future
reference or analysis.
○ Display Results: Presents the processed and analyzed tweets to the user,
potentially highlighting spam or sentiment classifications.
Data Flows:
● Enter query or tweet (User -> Validate User Input): The user's input is sent for
validation.
● Validated query or tweet (Validate User Input -> Construct API Request): The
validated input is used to create the API request.
● API request (Construct API Request -> Send Request to Twitter API): The
constructed request is sent to the Twitter API.
● Tweets (Twitter API -> Receive Response from Twitter API): The Twitter API
provides tweets in response to the request.
● Received tweets (Receive Response from Twitter API -> Preprocess Tweets):
The received tweets are sent for preprocessing.
● Preprocessed tweets (Preprocess Tweets -> Extract Features): The cleaned
tweets are used for feature extraction.
● Extracted features (Extract Features -> Spam Detection, Sentiment Analysis):
The identified features are used for both spam and sentiment analysis.
● Spam/Sentiment labels (Spam Detection, Sentiment Analysis -> Store Results):
The classification results are stored along with the tweets.
● Analyzed tweets with labels (Store Results -> Display Results): The final
processed and analyzed tweets are presented to the user.
Data Stores:
Classes:
● User:
○ Attributes:
■ userID (int): A unique identifier for each user.
■ username (String): The user's chosen username.
■ email (String): The user's email address.
■ password (String): The user's password.
○ Methods:
■ createAccount(): Allows a user to create a new account.
■ login(): Allows a user to log in to their existing account.
■ updateProfile(): Allows a user to update their profile information.
● Tweet:
○ Attributes:
■ tweetID (int): A unique identifier for each tweet (assumed, as not
explicitly shown in the diagram).
■ tweetText (String): The text content of the tweet.
■ userID (int): The identifier of the user who posted the tweet (foreign
key).
○ Methods: (Not explicitly shown in the diagram, but would likely include
methods for creating and managing tweets)
● Chatbot:
○ Attributes: (Not explicitly shown in the diagram, but would likely include
attributes related to its conversational abilities and state)
○ Methods: (Not explicitly shown in the diagram, but would likely include
methods for interacting with users and generating responses)
Relationships:
● User - Tweet (One-to-Many): A single user can post multiple tweets, but each
tweet belongs to only one user. This is indicated by the 1...* multiplicity on the
User side of the relationship.
● User - Chatbot (Interacts With): This association indicates that users can interact
with the chatbot, but the specific nature of the interaction isn't explicitly defined in
the diagram.
Key Points:
● The diagram focuses on the core entities involved in a system that likely involves
user interactions, tweet management, and a chatbot component.
● It highlights the basic attributes and methods of each class, but doesn't provide
details about the chatbot's functionality or the specific interactions between users
and the chatbot.
1. Retrieve tweets from Twitter API: This is the first step, where the system uses the
Twitter API to fetch tweets based on a specified query or criteria.
2. Preprocess tweets (clean, tokenize): The retrieved tweets are then cleaned and
preprocessed. This may involve removing irrelevant information such as stop
words, punctuation, and URLs, as well as tokenizing the text into individual words
or phrases.
3. Analyze sentiment using a trained model: The preprocessed tweets are then fed
into a sentiment analysis model, which classifies them as positive, negative, or
neutral based on their emotional tone.
4. Classify tweets as spam or ham: Based on the sentiment analysis results, the
tweets are classified as either spam or ham (non-spam).
5. Filter out spam tweets: If a tweet is classified as spam, it is filtered out and not
displayed.
6. Display tweet as ham: If a tweet is classified as ham, it is displayed.
The activity diagram also shows two alternative paths for displaying tweets:
● Display sentiment analysis results: If the user wants to see the sentiment
analysis results for a particular tweet, they can click on a button to display them.
● Display tweet as ham: If the user does not want to see the sentiment analysis
results, they can simply click on the tweet to display it.
Overall, this activity diagram provides a good overview of the process of using
sentiment analysis to filter out spam tweets from Twitter.
Figure 7 Activity Diagram Create Account
Participants:
● User: A person who interacts with the system to search for tweets.
● System: The main system that handles the retrieval and display of tweets.
● Chatbot: A component within the system that interacts with the user and
potentially assists with tweet retrieval.
Interactions:
Key Points:
● The diagram highlights the primary flow of interactions for a tweet search
scenario.
● It suggests potential chatbot involvement in tweet processing, but leaves the
exact nature of that involvement open for interpretation.
1. Tweet retrieval: The process starts with the retrieval of tweets, either through a
user query or a continuous stream. This could involve interacting with the Twitter
API to fetch relevant tweets based on specific criteria.
2. Preprocessing: The retrieved tweets are then preprocessed to prepare them for
further analysis. This might involve cleaning the text by removing unnecessary
characters, punctuation, and stop words. Additionally, tokenization might occur,
where the tweet is broken down into individual words or phrases.
3. Sentiment analysis: The preprocessed tweets are then sent to the sentiment
analyzer. This component analyzes the emotional tone of the text and classifies it
as positive, negative, or neutral.
4. Spam detection: Simultaneously, the tweets are also passed to the spam
detector. This component utilizes various techniques to identify tweets that are
likely to be spam, such as analyzing the content for suspicious keywords,
patterns, or links.
5. Collaboration and decision: The sentiment analysis results and the spam
detection outcome are then combined to make a final decision about the tweet.
This could involve:
○ Displaying the tweet: If the tweet is classified as non-spam and has a
neutral or positive sentiment, it might be directly displayed to the user.
○ Flagging or filtering: If the tweet is classified as spam or has a negative
sentiment, it might be flagged for further review or filtered out from the
results.
○ Chatbot intervention: Depending on the specific system design, the
chatbot might intervene in certain cases. For example, it could interact
with the user to clarify the intent of a negative tweet or provide additional
information about a flagged tweet.
Components:
● Twitter API: This component interacts with the Twitter API to retrieve tweets
based on a specified query or criteria.
● Preprocessor: This component cleans and preprocesses the retrieved tweets.
This might involve removing irrelevant information such as stop words,
punctuation, URLs, and usernames, as well as tokenizing the text into individual
words or phrases.
● Feature Extractor: This component extracts relevant features from the
preprocessed tweets. These features could be linguistic features like n-grams or
sentiment-related features like word frequency of positive and negative words.
● Spam Detector: This component uses the extracted features to classify tweets as
spam or ham (non-spam). It might employ machine learning algorithms like Naive
Bayes or Support Vector Machines trained on labeled data to make these
predictions.
● Sentiment Analyzer: This component analyzes the sentiment of the tweets,
classifying them as positive, negative, or neutral. It could also use machine
learning algorithms trained on labeled data to perform this task.
● Persistence Store: This component stores the analyzed tweets and their
associated labels (spam/ham and sentiment) for future use or analysis.
● Visualization Tool: This component displays the results of the analysis in a user-
friendly format, such as charts or graphs. This could allow users to see trends in
spam and sentiment over time or for specific topics.
Interactions:
1. Tweets retrieved: The Twitter API retrieves tweets based on the user's query or
criteria.
2. Preprocessing and feature extraction: The tweets are preprocessed and relevant
features are extracted.
3. Spam detection: The features are used by the spam detector to classify the
tweets as spam or ham.
4. Sentiment analysis: The features are also used by the sentiment analyzer to
classify the tweets as positive, negative, or neutral.
5. Persistence: The analyzed tweets and their labels are stored in the persistence
store.
6. Visualization: The visualization tool retrieves the stored data and displays the
results to the user.
Components:
Interactions:
1. Tweets retrieved: The Twitter Streaming API retrieves tweets based on the user's
query or criteria.
2. Streaming to Kafka: The tweets are streamed to Kafka, which buffers and
distributes them to the Spark Streaming component for real-time processing.
3. Real-time analysis: Spark Streaming performs real-time analysis on the tweets,
including preprocessing, feature extraction, and sentiment analysis.
4. Spam detection and Sentiment analysis: The extracted features are used by the
Spam Detector and Sentiment Analyzer to classify the tweets as spam/ham and
positive/negative/neutral, respectively.
5. Persistence and visualization: The analyzed tweets and their labels are stored in
Elasticsearch and visualized using Kibana.
Chatgpt
Huggingface
Kaggle
Appendix
A. Dataset Description:
The project utilized a diverse dataset comprising tweets sourced from the Twitter API.
The dataset included a mix of spam and legitimate tweets to ensure a representative
training and testing set for the developed algorithms.
A detailed overview of the machine learning models employed for spam detection,
including but not limited to Naive Bayes, Support Vector Machines (SVM), and neural
networks.
The rationale behind the selection of each model, hyperparameter tuning, and validation
strategies are discussed.
The NLP techniques applied in sentiment analysis, such as sentiment lexicons, word
embeddings, and deep learning architectures.
An exploration of how these techniques were adapted to handle the nuances of social
media language, slang, and emojis.
D. Feature Engineering:
Explanation of the key features used for spam detection, such as word frequency, user
engagement metrics, and time-based features.
E. System Architecture:
Overview of the system architecture, including the data flow, processing pipeline, and
integration points with external APIs or tools.
Details on the technology stack used, highlighting any specific frameworks, libraries, or
platforms that played a pivotal role in the project.
F. User Interface Design:
Description of the user interface components and functionalities, emphasizing the user-
friendly aspects and design considerations.
G. Evaluation Metrics:
A comprehensive list of metrics used to evaluate the performance of the spam detection
and sentiment analysis models.
Discussion on the choice of metrics and their relevance in the context of social media
analysis.
H. Ethical Considerations:
Reflection on the ethical considerations taken into account during the project, particularly
regarding privacy, bias, and the responsible use of user-generated content.