0% found this document useful (0 votes)
8 views

ExploratoryDataAnalysis

ExploratoryDataAnalysis

Uploaded by

Suyash Ghodke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ExploratoryDataAnalysis

ExploratoryDataAnalysis

Uploaded by

Suyash Ghodke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/380529836

"Exploratory Data Analysis"

Conference Paper · April 2023


DOI: 10.35629/5252-050413881392

CITATION READS

1 93

4 authors, including:

Mahendra Patil
Atharva College of Engineering, Malad
54 PUBLICATIONS 48 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mahendra Patil on 12 May 2024.

The user has requested enhancement of the downloaded file.


International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

“Exploratory Data Analysis”


Mohammed Salmanuddin1, Rushikesh Kulkarni2, Atharva
Mohite3, Prof.Mahendra Patil4
[1],[2],[3]
Students, Department of Computer Engineering, Atharva College of Engineering, Mumbai,
Maharashtra, India [4] Professor, Department of Computer Engineering, Atharva College of Engineering,
Mumbai, Maharashtra, India
----------------------------------------------------------------------------------------------------------------------------- ---------
Date of Submission: 15-04-2023 Date of Acceptance: 25-04-2023
---------------------------------------------------------------------------------------------------------------------------------------
ABSTRACT – This project aims to help incoming understanding your data before making any
students find suitable accommodation by using K- assumptions about it. Different techniques used for
Means and DBSCAN clustering algorithms. The analysis of the data are outlined below:
analysis is based on students' preferences for 1) Clustering and dimension reduction: Creates
amenities, budget, and proximity to the location. graphical displays of high-dimensional data with
The data consists of accommodation details in many variables.
various neighborhoods of the city. 2) Univariate visualization: Method of looking at a
The study utilized exploratory data analysis one variable of interest.
techniques, such as descriptive statistics, univariate 3) Multivariate visualizations: Analysis of multiple
visualization, and multivariate visualization, to gain variables at the same time.
insights into the dataset. K-Means and DBSCAN 4) K-Means clustering.
clustering algorithms were applied to classify the 5) Predictive Models.
accommodation into different clusters based on the
preferences of the students. The results showed that II. K-MEANS CLUSTERING
both algorithms successfully classified the We are given a data set of items, with
accommodation into clusters, with K-Means certain features, and values for these features (like
providing a more structured clustering, and a vector). The task is to categorize those items into
DBSCAN being more flexible and able to detect groups. To achieve this, we will use the kMeans
outliers and noise. algorithm; an unsupervised learning algorithm. ‘K’
In conclusion, the project successfully applied K- in the name of the algorithm represents the number
Means and DBSCAN clustering algorithms to of groups/clusters we want to classify our items
assist students in finding the best accommodation into.
in a new city. The study provided valuable insights The algorithm will categorize the items into k
into the preferences of students and how they groups or clusters of similarity. To calculate that
influence the choice of accommodation. The similarity, we will use the euclidean distance as
findings of the study can assist incoming students measurement. The algorithm works as follows:
in finding the most suitable accommodation based 1. First, we initialize k points, called means or
on their preferences. cluster centroids, randomly.
Keywords: Machine Learning,, Data Visualization, 2. We categorize each item to its closest mean and
Data Cleaning, Student accommodation, we update the mean’s coordinates, which are the
Geolocation, Geographic Information Systems, averages of the items categorized in that cluster so
Evaluation. far.
3. We repeat the process for a given number of
I. INTRODUCTION iterations and at the end, we have our clusters. This
Exploratory Data Analysis (EDA) is an project involves the use of K-Means Clustering to
approach for data analysis that utilizes a variety of find the best accommodation for students in a city
techniques to summarize main characteristics of the by classifying accommodation for incoming
data set, often with visual methods. EDA is useful students on the basis of their preferences on
for a range of purposes such as: Maximizing amenities, budget and proximity to the location.
insights into a data set, mapping out underlying Implementing the project will take you
structure of the data, identifying useful variables, through the daily life of a data science engineer -
detecting outliers and anomalies and Testing a from data preparation on real-life datasets to
hypothesis. EDA is about getting to know and visualizing the data and running machine learning

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1388
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

algorithms, to presenting the results. The objective is to use the K- means and DBSCAN
algorithm as it is an unsupervised learning method
III. DBSCAN CLUSTERING of Machine Learning technique.
DBSCAN (Density-Based Spatial It is relatively simple to implement and understand,
Clustering of Applications with Noise) is a popular guarantees convergence and mainly generalizes to
clustering algorithm used in machine learning to clusters of different shapes and sizes.
group similar data points based on their spatial
proximity and density. Unlike other clustering V. PROPOSED SOLUTION
algorithms that rely on a predetermined number of The existing system contains hostels and
clusters, DBSCAN is capable of finding clusters of apartments for rent, and it has bought and sold
arbitrary shapes and sizes, making it a flexible and options. It doesn’t recommend accommodation in
versatile tool for clustering data. our budget. It has rare cases of rental houses on our
The DBSCAN algorithm starts by preferences. It also doesn’t recommend restaurants,
selecting an unvisited data point and examining its gyms etc., based on users’ preferences previous
neighborhood defined by the eps parameter. If there research lacks the accuracy of true
are at least minPts points in the neighborhood, the recommendations.
point is considered a core point and a cluster is The Proposed system recommends
formed around it. The algorithm then expands the hostels, apartments as well as houses and it also
cluster by recursively adding all neighboring points displays the details of those houses, apartments and
that also have at least minPts neighbors in their hostels. It recommends accommodation within our
own neighborhood. budget and based on preferences given. It has large
The result of DBSCAN is a set of clusters, cases of houses on our budget. It also recommends
each containing a group of data points that are restaurants, gyms etc., based on users’ budgets. It
closely packed together and separated from other provides true recommendations without much
clusters by areas of lower density. The algorithm is lacking. We are using the K-means algorithm in
capable of detecting clusters of arbitrary shapes and this project, but it has a drawback when two
sizes, and it can handle noisy and sparse datasets. circular clusters centered at the same mean have
By analyzing the data using DBSCAN, it different radii. K-Means uses median values to
is possible to identify clusters of students who have define the cluster center and doesn’t differentiate
similar preferences and needs. This information can between the two clusters. It also fails when the sets
be used to make better decisions about the design are noncircular. To overcome this drawback, we
and location of student accommodation facilities. use the DBSCAN Algorithm along with K-means.
By using both K-means and DBSCAN, we can take
IV. OBJECTIVE advantage of the strengths of both algorithms. K-
While people migrate to a new city for means can be used to identify initial clusters, which
various purposes, like education, job location, etc., can then be refined using DBSCAN. This hybrid
one needs to handle the issues like a house or a approach can help to overcome the limitations of
place to stay, food necessities in that location, K-means while still maintaining its efficiency, as
environment, and many others. K-means can be computationally faster than
To avoid searching for a rental house manually by DBSCAN.
visiting place to place if there is properly analyzed Overall, combining K-means and
data regarding the rental house, and food DBSCAN can lead to more accurate and robust
preferences with preferred location then the clustering results, especially when dealing with
difficulties of an immigrant can be reduced as it is complex and non-circular clusters.
a basic necessity while migrating to a new city.
This need led us to think of an idea to provide such
properly analyzed clustered data for a given
location which can be helpful while looking for a
place to stay.
We have thought of using a specific means of
clustering method to cluster this unanalysed data 1.Get Datasets from the pertinent locations (Data
properly and present it to the client. In this analysis, Collection)
the main problem is the proper clustering of the 2.Clean the Datasets to prepare them for analysis.
available data and using that clustered data to plot (Data Cleaning via Pandas)
the data on the geolocational map according to the 3.Visualize the data using boxplots. (Using
clusters for a better understanding. Matplotlib /Seaborn /Pandas)

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1389
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

4.Fetch Geo-locational Data ((Foursquare API) 3. Cleaned API Data


REST APIs)
5.Use K-Means Clustering to cluster the
locations.
6.Discover the locations on the map. (Using
Folium/Seaborn)

VI. TOOLS / TECHNOLOGIES USED


1. Python :- Programming Language using for the
Code Implementation of Exploratory Analysis of
Data.
2. VScode :- An Integrated Development
Environment used for implementing the entire
project.
3. FourSquare API :- An API used to fetch 4. Cleaned Data
geolocational data.
4. Seaborn :- It is used to visualize the data using
boxplots.
5. Folium :- Used for plotting locations on the
map.
6. Pandas :- It is used for data cleaning to prepare
the data for further analysis

VII. RESULT AND ANALYSIS


1. BoxPlot Cleaned Data
5. Clustered API Data (K-Means)

2. BoxPlot By K-Means

6. Clustered Locations of Student


Accommodations (K-Means)

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1390
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

7. Clustered API Data (DBSCAN) X. CONCLUSION


In conclusion, the project model aimed to
develop a clustered map model that would assist
immigrant students and workers in finding suitable
accommodations in a new place. The project
utilized several techniques and methodologies such
as data mining, clustering algorithms, and Gantt
charts to implement the solution effectively. The
results showed that the application was successful
in clustering similar accommodations based on
location, price, and amenities, and it provided
accurate recommendations to the users.
The project has great potential for future
8. Clustered Locations of Student applications and improvements. The use of
Accommodations (DBSCAN) machine learning algorithms could enhance the
accuracy of recommendations, and the inclusion of
a feedback system could further improve the user
experience. Additionally, the project could be
expanded to cover more places and provide
information on other aspects such as transportation
and local culture. Overall, the project has the
potential to greatly benefit international students
and workers by providing them with a user-friendly
platform to find accommodations and settle into a
new environment with ease.

ACKNOWLEDGEMENT
VIII. APPLICATIONS We owe sincere thanks to our college
The project model could help students and Atharva College of Engineering for giving us a
workers identify areas with a high concentration of platform to prepare a project on the topic
accommodations that fit their budget and “Exploratory Analysis on Data'' and would like to
preferences, allowing them to make more informed thank our Principal Dr. Ramesh Kulkarni for
decisions about where to live. instigating within us the need for this research and
The project model can be used to analyze giving us the opportunities and time to conduct and
and predict the demand for accommodation in a present research on the topic.
specific location, which can be useful for
businesses in the hospitality industry. We are sincerely grateful for having Prof.
The clustering algorithms used in the Mahendra Patil as our guide and Prof. Suvarna
model can also be applied to other datasets with Pansambal, Head of the Computer Engineering
similar features, such as restaurant or retail store Department, for their encouragement, constant
locations. support and valuable suggestions. Moreover, the
completion of this research would have been
IX. FUTURE SCOPE impossible without the cooperation, suggestions
The project model can be further refined and help of our friends and family.
and expanded by incorporating additional features,
such as pricing data or customer reviews. REFERENCES
The model can be integrated with existing [1]. Exploratory Data Analysis Using
booking platforms to provide real-time Dimension Reduction [Tejas Nanaware ,
recommendations for users based on their Prashant Mahajan , Ravi Chandak, Pratik
preferences and location. Deshpande, Prof. Mahendra Patil ]
The project can be extended to include [2]. Automating Exploratory Data Analysis via
predictive analytics for seasonal fluctuations in Machine Learning [ Tova Milo, Amit
demand, which can help businesses optimize Somech ]
pricing and inventory management. [3]. Visualization Methods for Exploratory
Data Analysis [ IEEE A.Nasser ,
D.Hamad , C.Sar ]
DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1391
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

[4]. Exploratory Analysis of Geo-Locational


Data - Accommodation Recommendation
[ M. Sumithra, A.Sai Pavithra, L.Sowmiya
]
[5]. Clustering Evaluation by Davies-Bouldin
Index(DBI) in Cereal data using K-Means
[Akhilesh Kumar Singh;Shantanu
Mittal;Prashant Malhotra]
[6]. Exploratory Data Analysis using Artificial
Neural Networks by Sriram D , Kalaivani
K , Ulaga Priya K , Saritha A , Sajeevram
A
[7]. Exploratory analysis of the fire statistics
using automatic time series decomposition
[M.M. Tatur;A.G. Ivanitskiy]

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1392

View publication stats

You might also like