Fake and Automated Account - Report (Sathiyabama)
Fake and Automated Account - Report (Sathiyabama)
By
SCHOOL OF COMPUTING
SATHYABAMA
INSTITUTE OF SCIENCE AND
TECHNOLOGY (DEEMED TO BE
UNIVERSITY)
Accredited with Grade “A” by NAAC | 12B Status by UGC | Approved by
AICTE
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI - 600119
APRIL - 2023
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Bhavya k (Reg.No -
39110454) and Nikhitha k (Reg.No - 39110443) who carried out the Project Phase-2
entitled “DETECTING FAKE ACCOUNTS ON SOCIAL MEDIA -
INSTAGRAM” under my supervision from January 2023 to April 2023.
Internal Guide
I, Bhavya k(Reg.No- 39110454), hereby declare that the Project Phase-2 Report
entitled “DETECTING FAKE ACCOUNTS ON SOCIAL MEDIA - INSTAGRAM”
done by me under the guidance of Dr. D. Usha Nandini M.E., Ph.D., is
submitted in partial fulfillment of the requirements for the award of Bachelor of
Engineering degree in Computer Science and Engineering.
DATE: 20.04.2023
PLACE: Chennai SIGNATURE OF THE CANDIDATE
ACKNOWLEDGEMENT
The Proliferation of fake accounts on social media platforms like instagram has
become significant challenge for both users and platform administrators. Fake
accounts are created for a variety of purposes, including spreading spam,
disseminating false information, and engaging in fraudulent activity. These
accounts can also negatively impact the user experience by generating fake
engagement, flooding comment sections with spam, and diluting the quality of
content on the platform. Therefore, it become crucial to develop effective
methods for detecting and removing fake accounts from instagram.
TABLE OF CONTENTS
Chapter Page
TITLE
No. No.
ABSTRACT v
LIST OF TABLES ix
1 INTRODUCTION 1
LITERATURE SURVEY
2 4
2.1 Inferences from Literature Studies Survey
3 REQUIREMENTS ANALYSIS
3.2 Feasibility /Risk Analysis of the Project 9
3.2.1 Feasibility Studies 9
3.2.2 Risk Analysis 10
3.3 Software Requirements Specification Document
3.3.1 Hardware requirements 11
3.3.2 Software requirements 11
3.4 System Usecase 12
4 DESCRIPTION OF PROPOSED SYSTEM
4.2 Methodology or process model 13
4.3 Architecture of the Proposed System 15
Description of Software for Implementation
4.3 15
and Testing plan of the Proposed
Model/System
4.3.1 System 16
4.3.2 User 16
4.4 Project Management Selected Plan 17
5 IMPLEMENTATION DETAILS
5.2 Development and Deployment Setup 18
5.1.1 PyCharm 18
5.1.2 Python 21
5.1.3 Flask Framework 23
5.1.4 Libraries Used 25
5.3 Algorithms 30
5.4 Testing 35
5.4.1 Unit Testing 35
6 RESULTS AND DISCUSSION 37
7 CONCLUSION
7.2 Conclusion 42
7.3 Future work 42
7.4 Research Issues 43
7.5 Implementation Issues 44
REFERENCES 46
APPENDIX
A. SOURCE CODE 48
B. SCREEN SHOTS 54
C. RESEARCH PAPER 58
LIST OF FIGURES
6.5 Prediction 39
6.1 Accuracy 40
CHAPTER 1
INTRODUCTION
In recent times, some of the most well-known social network sites like Instagram,
Facebook and Twitter became integral parts of daily life. Social network sites are
used by people for e-commerce, entertainment, information and idea sharing
keeping in contact with long-lost friends and finding new acquaintances. Users of
OSN can post pictures and videos, leave comments and also provide likes to
pictures that have been shared. Fraudulent accounts are one of the downsides of
Online Social Networks. We can check an account is fake or not by following
parameters, the number of followers, and posts, or the characteristics of the
posted information such as the number of likes, views, comments, and shares. By
knowing this, many fake accounts are formed and started purchasing comments
and increasing the number of followers to become popular. These are particularly
created to commit fraudulent acts such as disseminating misleading information
and spreading malware. While some are created to gain followers to become
popular, spam comments and likes. In social media, one individual creates many
profiles with distinct identities, email addresses, and phone numbers. The majority
of them are likely fake accounts. There are two sorts of fake accounts: duplicate
accounts and fraudulent accounts. Users create multiple accounts as secondary
accounts to promote their e-businesses. People and influencers establish
duplicate accounts to enhance their businesses and spread useful information.
This type of account is not harmful and does not violate social media terms and
conditions. Users create fake accounts mainly to spread negativity, false news,
hate speech, and impersonate others; these accounts might be considered
dangerous and called as false accounts.
In recent years many celebrities and businesses have created their accounts on
Instagram, they use Instagram to grow their business and fans. Furthermore,
many of them and other famous users use it as a platform for advertising. When
someone is boosting the number of followers over a hundred thousand or
millions,
it is no surprise to use that person’s account as a lucrative earner. Instagram has
widely used for sharing photos and videos and is profitable for celebrities,
businesses, and people with a considerable number of followers. In the
meantime, this high profit made this platform prone to be the potential place to be
used for malicious activities.
Such versatility and spread for the proliferation of abnormal accounts, which
behave in unusual ways. Most academic researchers have mostly focused on
spammers and accounts, which put their efforts into spreading advertising, spam,
malware and other suspicious activities. These malicious accounts are usually
using automatic programs to improve their performance, hide their real identity,
and look like real users. In past years, media have reported that accounts of
celebrities, politicians, and some popular business has indicated suspicious
inflation of followers. Fake Instagram accounts specially used to increase the
number of followers of a target account
CHAPTER 2
LITERATURE SURVEY
Detection of fake accounts is one of the major issues that should be solved as
soon as possible. Though various methods have been already making an impact,
these all methods are not completely accurate. Inspired by various research
papers on detection, in this whole literature review section, we are going to
discuss various applied methodologies and approaches.
Ersahin et.al [1], provided a categorization technique to find Twitter's Phony
accounts. And pre-processed our dataset by applying the Entropy Minimization
Discretization (EMD) method under supervision to numerical characteristics, and
then they examined the output of the Naive Bayes algorithm. By merely pre-
processing their dataset using the n discretization strategy on chosen
characteristics, they were able to improve the accuracy with Nave Bayes
from85.55% to.90.41% for a method for identifying bogus accounts on Twitter.
Only by applying the numerical information from provide a safe platform that can
forecast and spot fake news in social media networks.
Aditi Gupta et.al [2], used the Facebook Graph API, and collected Facebook user
feeds. The intricate privacy controls on Facebook made it extremely difficult to
gather data. On a minimal dataset that included their node, their acquaintances in
their social neighbourhood, as well as a collection of manually recognized spam
accounts, they performed the most widely used supervised machine learning
classification approaches. They assessed these classifiers’ performance to
identify the top classifiers that produced high detection rates and their capacity to
identify Phony Facebook accounts.
Yeh-Cheng Chen et.al [3], introduced a novel, efficient technique for detecting
bogus profiles using machine learning. Their method based on
account-by-account behavioural analysis evaluate whether an account is
genuinely fake or not accurately, as opposed to utilizing graph-based or manually
predicted that just
cover basic facts. On real-world data, their detection models work admirably, and
their findings indicate that the models are not over fitting.
Estee van der walt et.al, [4] employed artificial intelligence to identify Phony
identities. The author of this research clarified how to spot false profiles made by
both bots and humans. In this study, the fraudulent accounts made on social
networks by people and bots were compared. Finally, it can be said that while the
features employed to detect the bots were successful, they weren’t entirely
effective.
Lu Zhang et.al [5], suggested the PSGD, a partially supervised model, to identify
scammer groups. As a scammer group detector, a classifier is studied using PU-
Learning via PSGD. The suggested PSGD is effective and surpasses cutting-edge
spammer detection techniques, according to tests on a real-world data set from
Amazon.cn.Bucket online media, they believe it to be a significant and extremely
promising outcome.
Sarah Khaled et.al [6], investigated effective methods for spotting bots and Phony
accounts on the Twitter network. An innovative technique called SVM-NN is
recommended in this to offer an effective detection for such profiles. Even though
the recommended approach uses fewer features, it can still be categorized with
around ninety- eight percent accuracy. In comparison to the other two classifiers,
the newly suggested method performs better in terms of accuracy across all
feature sets. The accuracy of the correlation feature set's records is astounding.
Zulfikar Alom et.al [7], investigated the nature of spam users on the Twitter
platform to develop an existing spam-detecting mechanism. In this paper, it is
designed a new mechanism and a more robust set of features to detect
spammers on Twitter. In this Random Forest classifier is used which gives a better
result compared to other methodologies. From this, they can plan to build a more
effective model which classifies various types of spammers within different social
networks.
Naman Singh et.al [8], proposed that on online networking platforms, more bogus
accounts are being created for nefarious purposes. They create a forged human
profile in order to successfully detect, identify, and remove fake profiles. For bots,
ML models employed a variety of factors to determine how many people an
account has on each platform’s friend list as followers. It is not possible to
contrast between Phony profiles made by humans and cyborgs. By using a data
set with Phony profiles and classifying them as fake or real, they can contrast fake
and real accounts.
Shivangi Singhal et.al [9], formally introduced unveil Spot Fake a multi-modal
platform for identifying bogus news. Without considering any other subtasks, their
suggested approach detects bogus news. It makes use of article textual and
graphic components. Text characteristics were learned using language models
(such as BERT), and image features were learned using VGG-19 pre-trained on
Image Net dataset. Twitter and Weibo, two publicly accessible databases, are
used in all of the studies.
Sowmya P et.al [10], one serious issue involves generating duplicate profile users
using the data of existing users. A collection of principles that may differentiate
real or fake are used to detect Phony profiles. A detection method that can find
fraudulent or clone profiles on online networking platforms like Twitter has been
presented.
Zeinab Shahbazi et.al [11], proposed an integrated system for multiple block
chains and NLP functions to leverage ML models to find false news and better
anticipate bogus user accounts and posts. For this workflow, Reinforcement
learning methodology is applied. The operation of the decentralized block chain
architecture, which offers the framework of digital content authorization proof,
improved the security of this platform. The idea behind this approach is to provide
a safe platform that can forecast and spot fake news on social media networks.
Latha P et.al [12], stated different categorization techniques have been used in the
past to identify Phony online networking media accounts. However, they must
improve their ability to spot Phony accounts on these websites. To improve the
accuracy rate of identifying bogus accounts, they use ML technologies and NLP
in our study. The Random Forest tree classifying algorithm was chosen.
Michael Jonathan Ekosputra et.al [13], study's findings make it evident that ml
may be used to detect fake profiles; the models that have been examined in this
article include Logistic Regression, Bernoulli Naive Bayes, Random Forest, svm,
and ANN. Any changes or additions to features will have an impact on each
model's accuracy. According to the research, adding parameters to a model will
almost certainly make it more accurate than a model with no parameters or a
default form. Based on the results of the research's initial trial, the Random Forest
algorithm achieves the best outcome with a startling accuracy of 0.92.
Karishma Anklesaria et.al [14], used different classifiers to classify the profile as
fake or real using database of user accounts. This is accomplished by using a
methodical approach that includes cleaning, pre-processing, feature selection,
and model training. After that, all the employed algorithms—Random Forest,
AdaBoost, MLP, SGD, and Artificial neural network—are compared on the basis of
different evaluation parameters. According to the study, Random Forest classifier
performs better.
Md Mahadi Hassan Sohan et.al [15], showed the importance of comments and
how they affect almost every aspect of social media data. People's perspectives
are significantly influenced by reviews. Hence, identifying fake reviews is an
exciting study area. A method for spotting fake reviews using machine learning
was disclosed in the study. It detects both review characteristics and reviewer
behaviour. The food review dataset was used to evaluate the proposed method.
The technique is semi-supervised, with several classifiers being used. In the fake
review detection procedure, the findings reveal that the Random Forest classifier
beats another classifier. Moreover, the findings imply that including in viewers'
behavioural traits raises the F-score by 55.5% and the overall accuracy to 97.7%.
The factors used by the existing systems to detect the fake accounts are very
less. The prediction becomes accurate when the number of parameters used are
more efficient. In previously used algorithms, if some of the inputs are not
appropriate, the algorithm could not produce accurate results.
● Low Accuracy.
● High complexity.
● Highly inefficient.
● Requires skilled persons.
CHAPTER 3
REQUIREMENT ANALYSIS
The feasibility of the project is analysed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out.
This is to ensure that the proposed system is not a burden to the company. For
feasibility analysis, some understanding of the major requirements for the system
is essential.
Data quality risks: The quality of the data used to train the machine learning
model can affect the accuracy of the model. The data used to train the model
should be representative of the population of interest and should be free from
bias. The data should also be of high quality, with minimal missing values and
errors.
Model accuracy risks: The accuracy of the model depends on the selection of
the right machine learning algorithms, feature engineering, and parameter tuning.
The machine learning algorithms used should be appropriate for the task of fake
account detection, and should be able to handle large amounts of data. Feature
engineering should be carefully performed to identify the most relevant features
for the task. Parameter tuning should be performed to optimize the performance
of the model.
Privacy and ethical risks: The collection and processing of personal data raises
ethical and privacy concerns. The project should adhere to ethical guidelines,
particularly those related to the collection and processing of personal data. The
project should also ensure that appropriate privacy policies are in place to protect
the privacy of individuals whose data is collected.
3.2 SOFTWARE REQUIREMENTS SPECIFICATION DOCUMENT
5. View data
6. Select model
7. Output to predict.
In the application user first uploads a CSV file of the dataset then the system
takes that file and undergoes preprocessing, which means Importing libraries
such as pandas, and NumPy. Import dataset, cleaning of null values, categorical
values will be turned into numerical values and removing unwanted columns. The
null values are replaced by using imputation technique. Then splits the dataset
into training and testing set. The target variable or Dependent variable undergoes
training and independent variables will go to testing. Next is model building.
Model buildings mean choosing a suitable algorithm and train it as per the
requirements. Different models like decision tree, Random Forest, and Logistic
Regression are considered, and train the models by importing those modules
from Sklearn. Then, train the models using the training dataset with fit( ) function.
After the model building, apply that model to make predictions and predict the
response for test dataset. Then evaluate the model’s performance by importing
different metrics from sklearn metrics. Random forest got the highest accuracy
with 90%. So, this is how the system will do preprocessing and User can view the
dataset.
Fig. 4.1 Flow chart
4.2 ARCHITECTURE / OVERALL DESIGN OF PROPOSED SYSTEM
1. Store Dataset:
● The System stores the dataset given by the user, where the dataset
is downloaded from the kaggle by user.
2. Model Training:
● The system takes the data from the user and fed that data to the
selected model.
4.3.2 User
1. Load Dataset
● The user can load the dataset he/she want to work on.
2. View Dataset
● The size of the dataset is 697 with 12 attributes such as profile
picture, length of the username, full name words, length of the full
name, the description provided by the user known as bio, any
external URLs provided, private or public accounts, number of posts
posted, number of followers and number of accounts the user
follows. The user can view the dataset
3. Select model:
● User can apply the model among random forest, decision tree,
logistic regression to the dataset for accuracy. where random forest
got highest accuracy of 91.8%.
4. Prediction
● Passing parameters to predict the output
Instagram is the easiest platform to spread rumors, to stole one's information etc.,
these all kind of malicious are carried using fake accounts. So should be aware of
the account he/she interacting with. With the help of this project anyone can
detect fake accounts on instagram. In order to make the detection easily available
we
also implemented an application and also carefully considered all the existing
systems
In the project, the whole process is done in the pycharm platform. Compared
three well known models and flask framework to create an application. Collected
the dataset from kaggle which contains different users instagram information.
5.1.1 Pycharm
PyCharm is a popular integrated development environment (IDE) for Python
programming language. Developed by JetBrains, PyCharm is designed to provide
developers with a wide range of tools and features to help them write, debug, and
maintain Python code more efficiently and effectively. In this article, we will take a
more in-depth look at the key features and functionalities of PyCharm.
User Interface
PyCharm has a clean and intuitive user interface that is designed to be user-
friendly and customizable. The IDE supports a range of themes and color
schemes, which can be customized to suit individual preferences. PyCharm also
supports multiple windows and tabbed interfaces, which allows developers to
work with multiple files and projects simultaneously.
Code Completion and Suggestion
One of the most powerful features of PyCharm is its intelligent code completion
and suggestion engine. The IDE can suggest method names, class names, and
variable names based on the context of the code. PyCharm can also provide
suggestions for import statements and function arguments. This feature can help
developers write code faster and with fewer errors.
Debugging Tools
PyCharm comes with a powerful debugger that allows developers to step through
their code, set breakpoints, inspect variables in real-time. The debugger also
supports remote debugging, which can be useful for debugging code running on
a remote server. PyCharm also provides support for Django, Flask, and Pyramid
frameworks, which allows developers to debug web applications directly from the
IDE.
Code Analysis and Error Highlighting
PyCharm has a built-in code analysis engine that can identify potential errors in
the code and highlight them for the developer. The IDE can flag issues like
undefined variables, unused imports, and syntax errors, which can help catch
bugs early and improve code quality. PyCharm also provides support for code
inspections, which can help identify issues like code smells, design problems,
and potential performance issues.
Version Control Integration
PyCharm has built-in integration with popular version control systems like Git,
SVN, and Mercurial. This allows developers to manage their code changes, track
changes over time, and collaborate with others on their projects. PyCharm
provides support for features like code merging, conflict resolution, and commit
history visualization.
Refactoring Tools
PyCharm provides a range of tools to help developers refactor their code. These
tools include renaming variables, extracting code into functions, and moving code
between files. PyCharm also provides support for safe delete, which allows
developers to delete code and its references without introducing errors in the
codebase.
Code Templates and Generation
PyCharm has a library of code templates that can be used to quickly generate
common code patterns, such as loops, if statements, and function definitions.
The IDE can also generate boilerplate code for popular frameworks like Flask and
Django. PyCharm also provides support for code generation from UML diagrams
and database schemas.
Testing Tools
PyCharm has built-in support for unit testing, including test runners, code
coverage analysis, and debugging tools for tests. The IDE can also generate test
stubs and test code automatically, based on the code being tested. PyCharm
also provides support for running tests remotely, which can be useful for testing
code running on a remote server.
Integration with Other Tools
PyCharm can be integrated with other tools in the development workflow, such as
code linters, task managers, and build tools. PyCharm provides support for
popular Python linters like Flake8, Pylint, and Pyflakes. The IDE can also be
integrated with build tools like setuptools, distutils, and pip.
In addition to these features, PyCharm also provides support for a range of other
functionalities, such as code documentation, code formatting, code search, and
code snippets. PyCharm is available in both a free, open-source community
edition and a paid professional edition with additional features and support. It
runs on Windows, macOS, and Linux, and it can be used for a wide range of
Python projects, from small scripts to large web applications.
Step 1
Note that the professional package involves all the advanced features and comes
with free trial for few days and the user has to buy a licensed key for activation
beyond the trial period. Community package is for free and can be downloaded
and installed as and when required. It includes all the basic features needed for
installation.
Step 2
Download the community package (executable file) onto your system and mention
PyCharm is the most popular IDE used for Python scripting language. This chapter
will give you an introduction to PyCharm and explains its features.
PyCharm offers some of the best features to its users and developers in the
following aspects −
Installing Python
2. Once the download is complete, run the exe for install Python. Now click on
Install Now.
4. When it finishes, you can see a screen that says the Setup was successful.
Now click on "Close".
Here are some key features of Python:
Readable and easy to learn: Python has a simple syntax and is easy to learn for
beginners. Its code is also highly readable and expressive, making it easy to
understand and maintain.
Interpreted language: Python is an interpreted language, meaning that code is
executed directly by the interpreter, without the need for compilation.
Cross-platform: Python code can run on a wide range of platforms, including
Windows, Mac, Linux, and mobile devices.
Dynamic typing: Python is a dynamically typed language, meaning that variable
types are inferred at runtime rather than being explicitly declared.
Strong standard library: Python has a large standard library that provides a wide
range of functionality out of the box, including modules for string processing,
regular expressions, network programming, and more.
Object-oriented: Python is an object-oriented language, meaning that everything
in Python is an object, including functions and modules.
High-level: Python is a high-level language, meaning that it provides abstractions
that make it easier to express complex concepts without worrying about low-level
details.
Versatile: Python can be used for a wide range of applications, including web
development, scientific computing, data analysis, artificial intelligence, and more.
Large community: Python has a large and active community of developers, with
a wealth of resources and libraries available.
Open source: Python is an open-source language, meaning that its source code
is freely available and can be modified and distributed by anyone.
Flask is a web application framework for Python, designed to make building web
applications easy and fast. It is a lightweight framework that is easy to learn and
use, making it a popular choice for small to medium-sized web applications.
Flask is built on top of the Werkzeug WSGI toolkit and the Jinja2 template engine,
which provides a powerful set of tools for handling requests and responses, as
well as rendering HTML templates.
Installation:
Install libraries-flask in pycharm
● Windows users can install flask via pip command:
pip install flask
One of the main features of Flask is its simplicity. Flask's API is very simple and
easy to use, with minimal boilerplate code required to get started. Flask is also
highly modular, allowing developers to easily add or remove components as
needed.
Flask provides a set of built-in tools for handling routing, input validation, form
processing, and more. It also supports a wide range of plugins and extensions,
making it easy to add additional functionality to your web application.
Flask uses a concept called "views" to handle requests from the user. A view is a
Python function that is decorated with the @app.route decorator, which specifies
the URL route that the function will handle. Views can return a variety of
responses, including HTML templates, JSON data, and more.
Flask also provides built-in support for handling authentication and security, with
support for OAuth, token-based authentication, and more. It also supports a
variety of databases, including SQLite, MySQL, and PostgreSQL, making it easy
to integrate your web application with a database.
Another feature of Flask is its support for testing. Flask provides a built-in testing
framework that allows developers to write automated tests for their web
application, ensuring that it is functioning correctly and reliably.
Overall, Flask is a powerful and flexible web application framework that is well-
suited for small to medium-sized web applications. Its simplicity, modularity, and
flexibility make it easy to learn and use, while its built-in tools and support for
plugins andextensions make it easy to add additional functionality as needed. If
you're looking to build a web application with Python, Flask is definitely worth
considering.
5.1.4 Libraries Used
1. Matplotlib
2. Scikit-learn
Scikit-learn (also known as sklearn) is an open-source machine learning library for
the Python programming language. It provides a wide range of tools for building
and applying machine learning models, including supervised and unsupervised
learning, dimensionality reduction, feature selection, and model selection and
evaluation.
Scikit-learn includes implementations of many popular machine learning
algorithms, such as linear regression, logistic regression, decision trees, random
forests, support vector machines (SVMs), k-nearest neighbors (KNN), and
clustering algorithms (k-means, hierarchical clustering, etc.). It also includes tools
for preprocessing data (e.g., scaling, normalization, encoding categorical
variables, etc.) and for evaluating the performance of machine learning models
(e.g., cross-validation, metrics for classification and regression, etc.).
Scikit-learn is designed to be easy to use and provides a consistent interface for
all the algorithms it implements. It is built on top of other popular scientific
computing libraries in Python, such as NumPy, SciPy, and matplotlib, and is
widely used in both academia and industry for a variety of machine learning tasks.
To use the scikit-learn library in your Python code, you can import it using the
following command:
import sklearn
This will allow you to use all the functions and tools provided by scikit-learn.
Scikit-learn is an open-source data analysis library and the Python ecosystem's
gold standard for Machine Learning (ML). The following are key concepts and
features: Algorithmic decision-making techniques, such as:
Classification is the process of identifying and categorizing data based on
patterns. Regression is the process of predicting or projecting data values using
the average mean of existing and planned data.
Clustering is the automatic classification of similar data into datasets.
Installation:
Install libraries – scikit-learn in pycharm
● Users can install scikit-learn via pip command:
pip install scikit-learn
3. Pandas
Pandas is an open-source library designed primarily for working quickly and
logically with relational or labelled data. It offers a range of data structures and
procedures for working with time - series data and quantitative information. The
NumPy library serves as the foundation for this library. Pandas is quick and offers
its users exceptional performance & productivity. Checking to see if pandas is
installed in the Python folder is the first step in using it. If not, use the pip
command to install it on our machine. Enter the command cmd0 in the search
window, and then use the cd command to find the location where the python-pip
file is installed. Locate it and enter the following command: pip install panda and
then import panda as pd. In our project, Panda’s library helps us work with the
csv format database we have acquired.
Pandas is an open-source data manipulation and analysis library for the Python
programming language. It provides data structures for efficiently storing and
manipulating large datasets, as well as tools for data cleaning, merging,
reshaping, and aggregation.
The two primary data structures provided by Pandas are Series and DataFrame. A
Series is a one-dimensional labeled array that can hold any data type (integers,
floating-point numbers, strings, Python objects, etc.). A DataFrame is a two-
dimensional labeled data structure with columns of potentially different types. It is
similar to a spreadsheet or a SQL table.
Pandas provides a wide range of functions for working with data, including
filtering, sorting, grouping, joining, and reshaping. It also includes powerful tools
for data visualization, time series analysis, and data input/output.
To use the Pandas library in your Python code, you can import it using the
following command:
import pandas as pd
This will allow you to use all the functions and data structures provided by Pandas.
Installation:
Install libraries-pandas in pycharm
● Users can install pandas via pip command:
pip install pandas
5. NumPy
NumPy (short for Numerical Python) is an open-source Python library that is used
for scientific computing and numerical analysis. It provides powerful tools for
working with arrays and matrices of numerical data, as well as a wide range of
mathematical functions for performing complex calculations.
To use the NumPy library in your Python code, you can import it using the
following command:
import numpy as np
This will allow you to use all the functions and data structures provided by
NumPy.It is open-source software. It contains various features including these
important ones:
Here are some of the specific plot types and functions that Seaborn provides:
Line plots: Seaborn provides several functions for creating line plots, including
lineplot, relplot, and tsplot. These functions can be used to visualize trends in
your data over time or across different variables.
Scatter plots: Seaborn's scatterplot function can be used to create scatter plots
with one or more variables. You can customize the size and color of the points
based on additional variables to create multi-dimensional scatter plots.
Bar plots: Seaborn's barplot function can be used to create bar plots with one or
more variables. You can customize the colors and orientation of the bars to create
different types of bar plots.
Heatmaps: Seaborn's heatmap function can be used to create heatmaps to
visualize relationships between two variables. You can customize the color map
and annotations to create informative heatmaps.
Pair plots: Seaborn's pairplot function can be used to create scatter plots
between all pairs of variables in your data set. This can be a useful way to explore
the relationships between multiple variables.
Joint plot: A joint plot is used to visualize the relationship between two
continuous variables and the distribution of each variable.
Histogram: A histogram is used to visualize the distribution of a single continuous
variable.
Box plot: A box plot is used to visualize the distribution of a single continuous
variable and to identify outliers.
Violin plot: A violin plot is similar to a box plot, but it provides a more detailed
view of the distribution of the data.
Regression plots: Seaborn's regplot and lmplot functions can be used to create
regression plots to visualize the relationship between two variables.
A tree has many analogies in real life, and turns out that it has influenced a wide
area of machine learning, covering both classification and regression. In decision
analysis, a decision tree can be used to visually and explicitly represent decisions
and decision making. As the name goes, it uses a tree-like model of decisions.
Though a commonly used tool in data mining for deriving a strategy to reach a
particular goal.
A decision tree is drawn upside down with its root at the top. In the image on the
left, the bold text in black represents a condition/internal node, based on which
the tree splits into branches/ edges. The end of the branch that doesn’t split
anymore is the decision/leaf, in this case, whether the passenger died or survived,
represented as red and green text respectively.
Although, a real dataset will have a lot more features and this will just be a branch
in a much bigger tree, but you can’t ignore the simplicity of this algorithm. The
feature importance is clear and relations can be viewed easily. This methodology
is more commonly known as learning decision tree from data and above tree is
called Classification tree as the target is to classify passenger as survived or died.
Regression trees are represented in the same manner, just they predict
continuous values like price of a house. In general, Decision Tree algorithms are
referred to as CART or Classification and Regression Trees.
So, what is actually going on in the background? Growing a tree involves deciding
on which features to choose and what conditions to use for splitting, along with
knowing when to stop. As a tree generally grows arbitrarily, you will need to trim it
down for it to look beautiful. Let’s start with a common technique used for
splitting
Decision trees are the building blocks of a random forest algorithm. A decision
tree is a decision support technique that forms a tree-like structure. An overview
of decision trees will help us understand how random forest algorithms work.
A decision tree consists of three components: decision nodes, leaf nodes, and a
root node. A decision tree algorithm divides a training dataset into branches,
which further segregate into other branches. This sequence continues until a leaf
node is attained. The leaf node cannot be segregated further.
The nodes in the decision tree represent attributes that are used for predicting the
outcome. Decision nodes provide a link to the leaves. The following diagram
shows the three types of nodes in a decision tree.
The information theory can provide more information on how decision trees work.
Entropy and information gain are the building blocks of decision trees. An
overview of these fundamental concepts will improve our understanding of how
decision trees are built.
Entropy is a metric for calculating uncertainty. Information gain is a measure of
how uncertainty in the target variable is reduced, given a set of independent
variables.
The information gain concept involves using independent variables (features) to
gain information about a target variable (class). The entropy of the target variable
(Y) and the conditional entropy of Y (given X) are used to estimate the information
gain. In this case, the conditional entropy is subtracted from the entropy of Y.
Information gain is used in the training of decision trees. It helps in reducing
uncertainty in these trees. A high information gain means that a high degree of
uncertainty (information entropy) has been removed. Entropy and information gain
are important in splitting branches, which is an important activity in the
construction of decision trees.
Let’s take a simple example of how a decision tree works. To predict if a customer
will purchase a mobile phone or not. The features of the phone form the basis of
his decision. This analysis can be presented in a decision tree diagram.
The root node and decision nodes of the decision represent the features of the
phone mentioned above. The leaf node represents the final output,
either buying or notbuying. The main features that determine the choice include
the price, internal storage, and Random Access Memory (RAM). The decision tree
will appear as follows.
5.3 TESTING
Define the Input and Output: The first step is to define the input and output of
the function or method that is testing. For example, the input could be a set of
features extracted from an Instagram account, and the output could be a binary
classification (real or fake).
Create Test Cases: Once you have defined the input and output, you can create
test cases to verify that the function or method works correctly. Test cases should
cover both positive and negative scenarios, including cases where the input is
valid and invalid.
Set Up Test Environment: You will need to set up a test environment to run the
test cases. This includes creating a testing database or file with sample data,
setting up any necessary dependencies, and configuring any environment
variables needed to run the tests.
Run Test Cases: Then run the test cases to check if the function or method is
working as expected. The tests should validate that the model accurately detects
real and fake accounts based on the provided features.
Evaluate Results: After running the tests, evaluate the results to determine if the
model is correctly identifying real and fake accounts. If any test cases fail, will
need to debug and fix the code to address the issue.
Repeat the Process: Unit testing is an iterative process, and need to repeat the
above steps as you make changes to the model or function.
.
CHAPTER 6
User can select any model among three models. Then the system will generate
accuracy for a particular selected model. Users can also view that generated
accuracy, this process will be done on the model page. The selection of the model
is done with the highest accuracy given algorithm. Logistic regression gave 89.9%
accuracy, Decision tree got 89.5% accuracy and Random forest gave the highest
accuracy of 91.8%.
Home Page
In the application ,the home page shows a welcome page of showing information
of detecting fake accounts.
Fig.6.1 Home
About Page
View Page
In the application, view page shows the information about the dataset with 12
attributes which is downloaded from kaggle website
Prediction
In the application, after choosing highest accuracy given algorithm, can able to
predict an account is fake or not with the user input
Feature Analysis
Accuracy
To illustrate this formula, consider a binary classification problem where there are
test set of 100 instances, of which 60 are labeled as class A and 40 are labeled as
class B. We train a binary classification model on the training set, and when we
evaluate the model on the test set.
Table of Accuracy
After comparing all the three machine learning model logistic regression, random
forest, and decision tree, Random forest got the highest accuracy with 91.8%.
CHAPTER 7
CONCLUSION
7.1 CONCLUSION
Feature selection and extraction: Identifying the most relevant features for fake
account detection is a crucial step in developing an effective machine learning
model. Research in this area could focus on identifying new features that can help
distinguish between real and fake accounts, as well as exploring the best
methods for feature extraction and selection.
Ethics and privacy: Developing a fake account detection model raises important
ethical and privacy concerns, particularly when it comes to user data collection
and sharing. Research in this area could explore the best practices for balancing
the need for data privacy with the need for data quality, as well as the ethical
implications of using machine learning to detect and remove fake accounts on
social media platforms.
Feature engineering: Identifying the most relevant features for fake account
detection is another important step in developing an effective machine learning
model. However, feature engineering can be a complex process that requires
careful consideration of both the features themselves and the algorithms used to
extract them. Implementing an effective feature engineering process will be key to
developing an accurate and efficient model.
Model selection and tuning: Choosing the most appropriate machine learning
algorithm and tuning its parameters is another important step in developing an
effective fake account detection model. Different algorithms and parameter
settings can have a significant impact on the performance of the model, and the
process of selecting and tuning the model can be time-consuming and require a
deep understanding of machine learning techniques.
Deployment and scalability: Once a fake account detection model has been
developed, deploying it to a production environment can be a complex process.
The model may need to be integrated with existing systems and software, and it
may need to be tested and optimized for scalability and performance.
Implementing an effective deployment and scalability plan will be important for
ensuring that the model can be used effectively in a real-world setting.
A. SOURCE CODE :
Home.html
<!DOCTYPEhtml>
<html>
<head>
<!-- Basic -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<!-- Mobile Metas -->
<meta name="viewport" content="width=device-width, initial-scale=1,
shrink-to-fit=no" />
<!-- Site Metas -->
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="" />
<title>Fake Account</title>
<link rel="icon" href="static/images/1.png" type="image/icon type">
<body>
<div class="hero_area">
<!-- header section strats -->
<header class="header_section">
<div class="container">
<nav class="navbar navbar-expand-lg custom_nav-container pt-3">
<a class="navbar-brand" href="{{url_for('index')}}">
<img src="static/images/1.png" alt="" style="width: 200px;" /><span>
</span>
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-
target="#navbarSupportedContent"
aria-controls="navbarSupportedContent" aria-expanded="false" aria-
label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
Index.html
<!DOCTYPE html>
<html>
<head>
<!-- Basic -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<!-- Mobile Metas -->
<meta name="viewport" content="width=device-width, initial-scale=1,
shrink-to-fit=no" />
<!-- Site Metas -->
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="" />
<title>Fake Account</title>
<link rel="icon" href="static/images/1.png" type="image/icon type">
<!-- slider stylesheet -->
<link rel="stylesheet" type="text/css"
href="https://cdnjs.cloudflare.com/ajax/libs/OwlCarousel2/2.1.3/assets/ow
l.carousel.min.css" />
<body>
<div class="hero_area">
<!-- header section strats -->
<header class="header_section">
<div class="container">
<nav class="navbar navbar-expand-lg custom_nav-container pt-3">
<a class="navbar-brand" href="{{url_for('index')}}">
<img src="static/images/1.png" alt="" style="width: 200px;" /><span>
</span>
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-
target="#navbarSupportedContent"
aria-controls="navbarSupportedContent" aria-expanded="false" aria-
label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<head>
<!-- Basic -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<!-- Mobile Metas -->
<meta name="viewport" content="width=device-width, initial-scale=1,
shrink-to-fit=no" />
<!-- Site Metas -->
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="" />
<title>Fake Account</title>
<link rel="icon" href="static/images/1.png" type="image/icon type">
href="https://cdnjs.cloudflare.com/ajax/libs/OwlCarousel2/2.1.3/assets/ow
l.carousel.min.css" />
<body>
<div class="hero_area">
<!-- header section strats -->
<header class="header_section">
<div class="container">
<nav class="navbar navbar-expand-lg custom_nav-container pt-3">
<a class="navbar-brand" href="{{url_for('index')}}">
<img src="static/images/1.png" alt="" style="width: 200px;" /><span>
</span>
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-
target="#navbarSupportedContent"
aria-controls="navbarSupportedContent" aria-expanded="false"
aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
Model.html
<!DOCTYPE html>
<html>
<head>
<!-- Basic -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<!-- Mobile Metas -->
<meta name="viewport" content="width=device-width, initial-scale=1,
shrink-to-fit=no" />
<!-- Site Metas -->
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="" />
<title>Fake Account</title>
<link rel="icon" href="static/images/1.png" type="image/icon type">
<!-- slider stylesheet -->
<link rel="stylesheet"
type="text/css"href="https://cdnjs.cloudflare.com/ajax/libs/OwlCarousel2/
2.1.3/assets/owl.carousel.min.css" />
<!-- bootstrap core css -->
<link rel="stylesheet" type="text/css" href="static/css/bootstrap.css" />
<!-- fonts style -->
<link
href="https://fonts.googleapis.com/css?family=Poppins:400,700&display=s
wa p" rel="stylesheet" />
<!-- Custom styles for this template -->
<link href="static/css/style.css" rel="stylesheet" />
<!-- responsive style -->
<link href="static/css/responsive.css" rel="stylesheet" />
</head>
<body>
<div class="hero_area">
<!-- header section strats -->
<header class="header_section">
<div class="container">
<nav class="navbar navbar-expand-lg custom_nav-container pt-3">
<a class="navbar-brand" href="{{url_for('index')}}">
<img src="static/images/1.png" alt="" style="width: 200px;" /><span>
</span>
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-
target="#navbarSupportedContent"
aria-controls="navbarSupportedContent" aria-expanded="false" aria-
label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarSupportedContent">
<div class="d-f avbar-nav ">
<li class="nav-item active">
<a class="nav-link" href="{{url_for('index')}}">Home <span class="sr-
only">(current)</span></a>
</li>
<li class="nav-item">
<a class="nav-link" href="{{url_for('about')}}">About</a>
</li>
<li class="nav-item">
<a class="nav-link" href="{{url_for('view')}}">View </a>
</li
Prediction.html
<!DOCTYPE html>
<html>
<head>
<!-- Basic -->
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<!-- Mobile Metas -->
<meta name="viewport" content="width=device-width, initial-scale=1,
shrink-to-fit=no" />
<!-- Site Metas -->
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="" />
<title>Fake Account</title>
<link rel="icon" href="static/images/1.png" type="image/icon type">
<!-- slider stylesheet -->
<link rel="stylesheet"
type="text/css"href="https://cdnjs.cloudflare.com/ajax/libs/OwlCarousel2/
2.1.3/assets/owl.carousel.min.css" />
<!-- bootstrap core css -->
<link rel="stylesheet" type="text/css" href="static/css/bootstrap.css" />
<!-- fonts style -->
<link
href="https://fonts.googleapis.com/css?family=Poppins:400,700&display=s
wa p" rel="stylesheet" />
<!-- Custom styles for this template -->
<link href="static/css/style.css" rel="stylesheet" />
<!-- responsive style -->
<link href="static/css/responsive.css" rel="stylesheet" />
</head>
<body>
<div class="hero_area">
<!-- header section strats -->
<header class="header_section">
<div class="container">
<nav class="navbar navbar-expand-lg custom_nav-container pt-3">
<a class="navbar-brand" href="{{url_for('index')}}">
<img src="static/images/1.png" alt="" style="width: 200px;" /><span>
</span>
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-
target="#navbarSupportedContent"
aria-controls="navbarSupportedContent" aria-expanded="false" aria-
label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarSupportedContent">
<div class="d-flex ml-auto flex-column flex-lg-row align-items-center">
<ul class="navbar-nav ">
<li class="nav-item active">
<a class="nav-link" href="{{url_for('index')}}">Home <span class="sr-
App.py
import pandas as pd
from flask import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier from
sklearn.ensemble import RandomForestClassifier from
sklearn.linear_model import LogisticRegression app =
Flask( name )
@app.route('/')
def index():
return render_template('index.html')
@app.route('/about')
def about():
return render_template('about.html')
# @app.route('/load',methods=["GET","POST"])
# def load():
# global df, dataset
# if request.method == "POST":
# data = request.files['data']
# df = pd.read_csv(data)
# dataset = df.head(100)
# msg = 'Data Loaded Successfully'
# return render_template('load.html', msg=msg)
# return render_template('load.html')
@app.route('/view')
def view():
global df, dataset
df = pd.read_csv('data.csv')
dataset = df.head(100)
return render_template('view.html', columns=dataset.columns.values,
rows=dataset.values.tolist())
@app.route('/model',methods=['POST','GET'])
def model():
if request.method=="POST":
data = pd.read_csv('data.csv')
data.head()
x=data.iloc[:,:-1]
y=data.iloc[:,-1]
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.3,stratify=y,random_state=42)print('cccc
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc')
s=int(request.form['algo'])
if s==0:
return render_template('model.html',msg='Please Choose an Algorithm to
Train')
elif
s==1:print('aaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb')
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr=lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
acc_lr = accuracy_score(y_test,y_pred)*100
print('aaaaaaaaaaaaaaaaaaaaaaaaa')
msg = 'The accuracy obtained by Logistic Regression is ' +
str(acc_lr) + str('%')
B. SCREEN SHOTS
Ipnyb.py
C. RESEARCH PAPER :