ExploratoryDataAnalysis

Uploaded by

Suyash Ghodke

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

ExploratoryDataAnalysis

Uploaded by

Suyash Ghodke

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/380529836

"Exploratory Data Analysis"

Conference Paper · April 2023

DOI: 10.35629/5252-050413881392

CITATION READS

1 93

4 authors, including:

Mahendra Patil
Atharva College of Engineering, Malad
54 PUBLICATIONS 48 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mahendra Patil on 12 May 2024.

The user has requested enhancement of the downloaded file.

International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

“Exploratory Data Analysis”

Mohammed Salmanuddin1, Rushikesh Kulkarni2, Atharva
Mohite3, Prof.Mahendra Patil4
[1],[2],[3]
Students, Department of Computer Engineering, Atharva College of Engineering, Mumbai,
Maharashtra, India [4] Professor, Department of Computer Engineering, Atharva College of Engineering,
Mumbai, Maharashtra, India
----------------------------------------------------------------------------------------------------------------------------- ---------
Date of Submission: 15-04-2023 Date of Acceptance: 25-04-2023
---------------------------------------------------------------------------------------------------------------------------------------
ABSTRACT – This project aims to help incoming understanding your data before making any
students find suitable accommodation by using K- assumptions about it. Different techniques used for
Means and DBSCAN clustering algorithms. The analysis of the data are outlined below:
analysis is based on students' preferences for 1) Clustering and dimension reduction: Creates
amenities, budget, and proximity to the location. graphical displays of high-dimensional data with
The data consists of accommodation details in many variables.
various neighborhoods of the city. 2) Univariate visualization: Method of looking at a
The study utilized exploratory data analysis one variable of interest.
techniques, such as descriptive statistics, univariate 3) Multivariate visualizations: Analysis of multiple
visualization, and multivariate visualization, to gain variables at the same time.
insights into the dataset. K-Means and DBSCAN 4) K-Means clustering.
clustering algorithms were applied to classify the 5) Predictive Models.
accommodation into different clusters based on the
preferences of the students. The results showed that II. K-MEANS CLUSTERING
both algorithms successfully classified the We are given a data set of items, with
accommodation into clusters, with K-Means certain features, and values for these features (like
providing a more structured clustering, and a vector). The task is to categorize those items into
DBSCAN being more flexible and able to detect groups. To achieve this, we will use the kMeans
outliers and noise. algorithm; an unsupervised learning algorithm. ‘K’
In conclusion, the project successfully applied K- in the name of the algorithm represents the number
Means and DBSCAN clustering algorithms to of groups/clusters we want to classify our items
assist students in finding the best accommodation into.
in a new city. The study provided valuable insights The algorithm will categorize the items into k
into the preferences of students and how they groups or clusters of similarity. To calculate that
influence the choice of accommodation. The similarity, we will use the euclidean distance as
findings of the study can assist incoming students measurement. The algorithm works as follows:
in finding the most suitable accommodation based 1. First, we initialize k points, called means or
on their preferences. cluster centroids, randomly.
Keywords: Machine Learning,, Data Visualization, 2. We categorize each item to its closest mean and
Data Cleaning, Student accommodation, we update the mean’s coordinates, which are the
Geolocation, Geographic Information Systems, averages of the items categorized in that cluster so
Evaluation. far.
3. We repeat the process for a given number of
I. INTRODUCTION iterations and at the end, we have our clusters. This
Exploratory Data Analysis (EDA) is an project involves the use of K-Means Clustering to
approach for data analysis that utilizes a variety of find the best accommodation for students in a city
techniques to summarize main characteristics of the by classifying accommodation for incoming
data set, often with visual methods. EDA is useful students on the basis of their preferences on
for a range of purposes such as: Maximizing amenities, budget and proximity to the location.
insights into a data set, mapping out underlying Implementing the project will take you
structure of the data, identifying useful variables, through the daily life of a data science engineer -
detecting outliers and anomalies and Testing a from data preparation on real-life datasets to
hypothesis. EDA is about getting to know and visualizing the data and running machine learning

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1388
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

algorithms, to presenting the results. The objective is to use the K- means and DBSCAN
algorithm as it is an unsupervised learning method
III. DBSCAN CLUSTERING of Machine Learning technique.
DBSCAN (Density-Based Spatial It is relatively simple to implement and understand,
Clustering of Applications with Noise) is a popular guarantees convergence and mainly generalizes to
clustering algorithm used in machine learning to clusters of different shapes and sizes.
group similar data points based on their spatial
proximity and density. Unlike other clustering V. PROPOSED SOLUTION
algorithms that rely on a predetermined number of The existing system contains hostels and
clusters, DBSCAN is capable of finding clusters of apartments for rent, and it has bought and sold
arbitrary shapes and sizes, making it a flexible and options. It doesn’t recommend accommodation in
versatile tool for clustering data. our budget. It has rare cases of rental houses on our
The DBSCAN algorithm starts by preferences. It also doesn’t recommend restaurants,
selecting an unvisited data point and examining its gyms etc., based on users’ preferences previous
neighborhood defined by the eps parameter. If there research lacks the accuracy of true
are at least minPts points in the neighborhood, the recommendations.
point is considered a core point and a cluster is The Proposed system recommends
formed around it. The algorithm then expands the hostels, apartments as well as houses and it also
cluster by recursively adding all neighboring points displays the details of those houses, apartments and
that also have at least minPts neighbors in their hostels. It recommends accommodation within our
own neighborhood. budget and based on preferences given. It has large
The result of DBSCAN is a set of clusters, cases of houses on our budget. It also recommends
each containing a group of data points that are restaurants, gyms etc., based on users’ budgets. It
closely packed together and separated from other provides true recommendations without much
clusters by areas of lower density. The algorithm is lacking. We are using the K-means algorithm in
capable of detecting clusters of arbitrary shapes and this project, but it has a drawback when two
sizes, and it can handle noisy and sparse datasets. circular clusters centered at the same mean have
By analyzing the data using DBSCAN, it different radii. K-Means uses median values to
is possible to identify clusters of students who have define the cluster center and doesn’t differentiate
similar preferences and needs. This information can between the two clusters. It also fails when the sets
be used to make better decisions about the design are noncircular. To overcome this drawback, we
and location of student accommodation facilities. use the DBSCAN Algorithm along with K-means.
By using both K-means and DBSCAN, we can take
IV. OBJECTIVE advantage of the strengths of both algorithms. K-
While people migrate to a new city for means can be used to identify initial clusters, which
various purposes, like education, job location, etc., can then be refined using DBSCAN. This hybrid
one needs to handle the issues like a house or a approach can help to overcome the limitations of
place to stay, food necessities in that location, K-means while still maintaining its efficiency, as
environment, and many others. K-means can be computationally faster than
To avoid searching for a rental house manually by DBSCAN.
visiting place to place if there is properly analyzed Overall, combining K-means and
data regarding the rental house, and food DBSCAN can lead to more accurate and robust
preferences with preferred location then the clustering results, especially when dealing with
difficulties of an immigrant can be reduced as it is complex and non-circular clusters.
a basic necessity while migrating to a new city.
This need led us to think of an idea to provide such
properly analyzed clustered data for a given
location which can be helpful while looking for a
place to stay.
We have thought of using a specific means of
clustering method to cluster this unanalysed data 1.Get Datasets from the pertinent locations (Data
properly and present it to the client. In this analysis, Collection)
the main problem is the proper clustering of the 2.Clean the Datasets to prepare them for analysis.
available data and using that clustered data to plot (Data Cleaning via Pandas)
the data on the geolocational map according to the 3.Visualize the data using boxplots. (Using
clusters for a better understanding. Matplotlib /Seaborn /Pandas)

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1389
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

4.Fetch Geo-locational Data ((Foursquare API) 3. Cleaned API Data

REST APIs)
5.Use K-Means Clustering to cluster the
locations.
6.Discover the locations on the map. (Using
Folium/Seaborn)

VI. TOOLS / TECHNOLOGIES USED

1. Python :- Programming Language using for the
Code Implementation of Exploratory Analysis of
Data.
2. VScode :- An Integrated Development
Environment used for implementing the entire
project.
3. FourSquare API :- An API used to fetch 4. Cleaned Data
geolocational data.
4. Seaborn :- It is used to visualize the data using
boxplots.
5. Folium :- Used for plotting locations on the
map.
6. Pandas :- It is used for data cleaning to prepare
the data for further analysis

VII. RESULT AND ANALYSIS

1. BoxPlot Cleaned Data
5. Clustered API Data (K-Means)

2. BoxPlot By K-Means

6. Clustered Locations of Student

Accommodations (K-Means)

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1390
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

7. Clustered API Data (DBSCAN) X. CONCLUSION

In conclusion, the project model aimed to
develop a clustered map model that would assist
immigrant students and workers in finding suitable
accommodations in a new place. The project
utilized several techniques and methodologies such
as data mining, clustering algorithms, and Gantt
charts to implement the solution effectively. The
results showed that the application was successful
in clustering similar accommodations based on
location, price, and amenities, and it provided
accurate recommendations to the users.
The project has great potential for future
8. Clustered Locations of Student applications and improvements. The use of
Accommodations (DBSCAN) machine learning algorithms could enhance the
accuracy of recommendations, and the inclusion of
a feedback system could further improve the user
experience. Additionally, the project could be
expanded to cover more places and provide
information on other aspects such as transportation
and local culture. Overall, the project has the
potential to greatly benefit international students
and workers by providing them with a user-friendly
platform to find accommodations and settle into a
new environment with ease.

ACKNOWLEDGEMENT
VIII. APPLICATIONS We owe sincere thanks to our college
The project model could help students and Atharva College of Engineering for giving us a
workers identify areas with a high concentration of platform to prepare a project on the topic
accommodations that fit their budget and “Exploratory Analysis on Data'' and would like to
preferences, allowing them to make more informed thank our Principal Dr. Ramesh Kulkarni for
decisions about where to live. instigating within us the need for this research and
The project model can be used to analyze giving us the opportunities and time to conduct and
and predict the demand for accommodation in a present research on the topic.
specific location, which can be useful for
businesses in the hospitality industry. We are sincerely grateful for having Prof.
The clustering algorithms used in the Mahendra Patil as our guide and Prof. Suvarna
model can also be applied to other datasets with Pansambal, Head of the Computer Engineering
similar features, such as restaurant or retail store Department, for their encouragement, constant
locations. support and valuable suggestions. Moreover, the
completion of this research would have been
IX. FUTURE SCOPE impossible without the cooperation, suggestions
The project model can be further refined and help of our friends and family.
and expanded by incorporating additional features,
such as pricing data or customer reviews. REFERENCES
The model can be integrated with existing [1]. Exploratory Data Analysis Using
booking platforms to provide real-time Dimension Reduction [Tejas Nanaware ,
recommendations for users based on their Prashant Mahajan , Ravi Chandak, Pratik
preferences and location. Deshpande, Prof. Mahendra Patil ]
The project can be extended to include [2]. Automating Exploratory Data Analysis via
predictive analytics for seasonal fluctuations in Machine Learning [ Tova Milo, Amit
demand, which can help businesses optimize Somech ]
pricing and inventory management. [3]. Visualization Methods for Exploratory
Data Analysis [ IEEE A.Nasser ,
D.Hamad , C.Sar ]
DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1391
International Journal of Advances in Engineering and Management (IJAEM)
Volume 5, Issue 4 April 2023, pp: 1388-1392 www.ijaem.net ISSN: 2395-5252

[4]. Exploratory Analysis of Geo-Locational

Data - Accommodation Recommendation
[ M. Sumithra, A.Sai Pavithra, L.Sowmiya
]
[5]. Clustering Evaluation by Davies-Bouldin
Index(DBI) in Cereal data using K-Means
[Akhilesh Kumar Singh;Shantanu
Mittal;Prashant Malhotra]
[6]. Exploratory Data Analysis using Artificial
Neural Networks by Sriram D , Kalaivani
K , Ulaga Priya K , Saritha A , Sajeevram
A
[7]. Exploratory analysis of the fire statistics
using automatic time series decomposition
[M.M. Tatur;A.G. Ivanitskiy]

DOI: 10.35629/5252-050413881392 |Impact Factorvalue 6.18| ISO 9001: 2008 Certified Journal Page 1392

View publication stats

Customer Segmentation and Profiling Thesis
100% (2)
Customer Segmentation and Profiling Thesis
76 pages
ME5107: Numerical Methods in Thermal Engineering
No ratings yet
ME5107: Numerical Methods in Thermal Engineering
21 pages
Accomodation Recommendation and Booking For Students
No ratings yet
Accomodation Recommendation and Booking For Students
5 pages
Research Paper1
No ratings yet
Research Paper1
4 pages
Exploratory Analysis and Geolocation of Data To Help Student Find Housing Facilities
No ratings yet
Exploratory Analysis and Geolocation of Data To Help Student Find Housing Facilities
5 pages
Zhang Haoze 202112 MSC
No ratings yet
Zhang Haoze 202112 MSC
114 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
21CSA301 Datamining-Final
No ratings yet
21CSA301 Datamining-Final
10 pages
Lecture-1-Introduction-to-Data-Mining
No ratings yet
Lecture-1-Introduction-to-Data-Mining
50 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Question Bank 2
No ratings yet
Question Bank 2
4 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Evaluating Student's Performance Using K-Means Clustering: Rakesh Kumar Arora, Dr. Dharmendra Badal
No ratings yet
Evaluating Student's Performance Using K-Means Clustering: Rakesh Kumar Arora, Dr. Dharmendra Badal
5 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Report of Assignment 3 ML
No ratings yet
Report of Assignment 3 ML
6 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Case Study-1: Department of Computer Science and Engineering (7 Semester)
No ratings yet
Case Study-1: Department of Computer Science and Engineering (7 Semester)
16 pages
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
No ratings yet
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
31 pages
(Ebook) Introduction to Data Mining by Pang-Ning Tan,Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X - Download the full ebook now for a seamless reading experience
100% (1)
(Ebook) Introduction to Data Mining by Pang-Ning Tan,Michael Steinbach and Vipin Kumar ISBN 9788131764633, 813176463X - Download the full ebook now for a seamless reading experience
56 pages
Range and Factors
No ratings yet
Range and Factors
2 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
154 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Practical Data Analysis Cookbook - Sample Chapter
100% (1)
Practical Data Analysis Cookbook - Sample Chapter
31 pages
HaftamuA ArticleReview
No ratings yet
HaftamuA ArticleReview
39 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
Customer Segmentation Using Machine Learning With A Coupon Generator GUI
No ratings yet
Customer Segmentation Using Machine Learning With A Coupon Generator GUI
6 pages
1120pm - 85.epra Journals 8308
No ratings yet
1120pm - 85.epra Journals 8308
7 pages
Application of Ant K-Means
No ratings yet
Application of Ant K-Means
16 pages
educational-data-mining-the-case-of-department-of-mathematics-and-computing-in-the-period-2009-2018
No ratings yet
educational-data-mining-the-case-of-department-of-mathematics-and-computing-in-the-period-2009-2018
5 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
BIL Report
No ratings yet
BIL Report
24 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
200 pages
Introduction to Data Mining 2005th Edition Pang-Ning Tan download pdf
100% (21)
Introduction to Data Mining 2005th Edition Pang-Ning Tan download pdf
60 pages
Chapter - 1: 1.1 Overview
No ratings yet
Chapter - 1: 1.1 Overview
50 pages
Interactive and Dynamic Graphics For Data Analysis
No ratings yet
Interactive and Dynamic Graphics For Data Analysis
169 pages
DMW Notes UNIT-1 2023-24
No ratings yet
DMW Notes UNIT-1 2023-24
15 pages
Introduction to Data Mining 2005th Edition Pang-Ning Tan - The full ebook with all chapters is available for download now
100% (1)
Introduction to Data Mining 2005th Edition Pang-Ning Tan - The full ebook with all chapters is available for download now
54 pages
Variance Rover System
No ratings yet
Variance Rover System
3 pages
Top 10 Data Mining Papers
No ratings yet
Top 10 Data Mining Papers
126 pages
Ai Cep Report
No ratings yet
Ai Cep Report
21 pages
Research Paper2
No ratings yet
Research Paper2
4 pages
Lecture 1 Introduction To Data Mining
No ratings yet
Lecture 1 Introduction To Data Mining
50 pages
20600222047_Manish_Bej_IT_CA2_DWDM
No ratings yet
20600222047_Manish_Bej_IT_CA2_DWDM
4 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
p196 - Knowledge Discovery in Databases
No ratings yet
p196 - Knowledge Discovery in Databases
8 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Dunham - Data Mining PDF
No ratings yet
Dunham - Data Mining PDF
156 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
5 pages
Master Thesis
No ratings yet
Master Thesis
97 pages
Unit 4
No ratings yet
Unit 4
5 pages
Ebooks File Introduction To Data Mining 2005th Edition Pang-Ning Tan All Chapters
100% (14)
Ebooks File Introduction To Data Mining 2005th Edition Pang-Ning Tan All Chapters
84 pages
Comparative Study of Data Mining Tools
No ratings yet
Comparative Study of Data Mining Tools
8 pages
1.3 What Kind of Data Can Be Mined?
No ratings yet
1.3 What Kind of Data Can Be Mined?
5 pages
Rewriting The Equations of Motion
No ratings yet
Rewriting The Equations of Motion
5 pages
Computer Graphics - Hidden Surface Elimination
100% (1)
Computer Graphics - Hidden Surface Elimination
68 pages
Homework 2 Solved PDF
No ratings yet
Homework 2 Solved PDF
6 pages
DTFT Continue
No ratings yet
DTFT Continue
51 pages
A Bidirectional LSTM Deep Learning Approach For Intrusion Detection
No ratings yet
A Bidirectional LSTM Deep Learning Approach For Intrusion Detection
30 pages
STS 3
No ratings yet
STS 3
9 pages
Shapley-Based Explainable AI For Clustering
No ratings yet
Shapley-Based Explainable AI For Clustering
23 pages
Area of Specialization (2nd Year Project)
No ratings yet
Area of Specialization (2nd Year Project)
13 pages
EC Control-System
No ratings yet
EC Control-System
58 pages
Ps Lab Expt 2
No ratings yet
Ps Lab Expt 2
6 pages
2020 - Zhou Et Al. - TagGen
No ratings yet
2020 - Zhou Et Al. - TagGen
11 pages
Hillier 6e ch14 Web
No ratings yet
Hillier 6e ch14 Web
111 pages
Preparation: Reading Skills Practice: Robots: Friend or Foe? - Exercises
No ratings yet
Preparation: Reading Skills Practice: Robots: Friend or Foe? - Exercises
3 pages
Generative AI
No ratings yet
Generative AI
4 pages
Linear Imaging Systems Example: The Pinhole Camera: Outline
No ratings yet
Linear Imaging Systems Example: The Pinhole Camera: Outline
36 pages
AI Unit 4
No ratings yet
AI Unit 4
11 pages
Erlang B - Final
No ratings yet
Erlang B - Final
25 pages
Second-Order Subdifferential Calculus With Applications To Tilt Stability in Optimization
No ratings yet
Second-Order Subdifferential Calculus With Applications To Tilt Stability in Optimization
34 pages
Lec-13.SS (3150912) Convolution Integral and Sum
No ratings yet
Lec-13.SS (3150912) Convolution Integral and Sum
32 pages
Chapter 16
No ratings yet
Chapter 16
24 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Parkinson Disease Prediction Using Feature Selection Technique in Machine Learning
No ratings yet
Parkinson Disease Prediction Using Feature Selection Technique in Machine Learning
5 pages
Dsa Hashingppt
No ratings yet
Dsa Hashingppt
8 pages
TE AI Honor Course
No ratings yet
TE AI Honor Course
18 pages
Result Graduation 3rd Year
No ratings yet
Result Graduation 3rd Year
2 pages
AI Class 10 Sample Paper 1
80% (10)
AI Class 10 Sample Paper 1
6 pages
Hydrological Modeling Using Generalized Artificial Neuron Model
No ratings yet
Hydrological Modeling Using Generalized Artificial Neuron Model
24 pages
Black Scholes Model
No ratings yet
Black Scholes Model
6 pages
Invoice Classification Using Deep Features and Machine Learning Techniques
No ratings yet
Invoice Classification Using Deep Features and Machine Learning Techniques
5 pages