EDA - Exploratory Data Analysis in Python
Last Updated :
10 May, 2025
Exploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration and insights generation to help in further modeling and analysis. In this article, we will see how to perform EDA using python.
Key Steps for Exploratory Data Analysis (EDA)
Lets see various steps involved in Exploratory Data Analysis:
Step 1: Importing Required Libraries
We need to install Pandas, NumPy, Matplotlib and Seaborn libraries in python to proceed further.
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')
Step 2: Reading Dataset
Download the dataset from this link and lets read it using pandas.
Python
df = pd.read_csv("/content/WineQT.csv")
print(df.head())
Output:
First 5 rowsStep 3: Analyzing the Data
1. df.shape(): This function is used to understand the number of rows (observations) and columns (features) in the dataset. This gives an overview of the dataset's size and structure.
Python
Output:
(1143, 13)
2. df.info()
: This function helps us to understand the dataset by showing the number of records in each column, type of data, whether any values are missing and how much memory the dataset uses.
Python
Output:
info()3. df.describe(): This method gives a statistical summary of the DataFrame showing values like count, mean, standard deviation, minimum and quartiles for each numerical column. It helps in summarizing the central tendency and spread of the data.
Python
Output:
describe()4. df.columns.tolist(): This converts the column names of the DataFrame into a Python list making it easy to access and manipulate the column names.
Python
Output:
column namesStep 4 : Checking Missing Values
df.isnull().sum(): This checks for missing values in each column and returns the total number of null values per column helping us to identify any gaps in our data.
Python
Output:
Missing values in each columnStep 5 : Checking for the duplicate values
df.nunique(): This function tells us how many unique values exist in each column which provides insight into the variety of data in each feature.
Python
Output:
nunique()Step 6: Univariate Analysis
In Univariate analysis plotting the right charts can help us to better understand the data making the data visualization so important.
1. Bar Plot for evaluating the count of the wine with its quality rate.
Python
quality_counts = df['quality'].value_counts()
plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()
Output:
Bar PlotHere, this count plot graph shows the count of the wine with its quality rate.
2. Kernel density plot for understanding variance in the dataset
Python
sns.set_style("darkgrid")
numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns
plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")
plt.tight_layout()
plt.show()
Output:
Kernel density plotThe features in the dataset with a skewness of 0 shows a symmetrical distribution. If the skewness is 1 or above it suggests a positively skewed (right-skewed) distribution. In a right-skewed distribution the tail extends more to the right which shows the presence of extremely high values.
3. Swarm Plot for showing the outlier in the data
Python
plt.figure(figsize=(10, 8))
sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')
plt.title('Swarm Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
Output:
Swarm PlotThis graph shows the swarm plot for the 'Quality' and 'Alcohol' columns. The higher point density in certain areas shows where most of the data points are concentrated. Points that are isolated and far from these clusters represent outliers highlighting uneven values in the dataset.
Step 7: Bivariate Analysis
In bivariate analysis two variables are analyzed together to identify patterns, dependencies or interactions between them. This method helps in understanding how changes in one variable might affect another.
Let's visualize these relationships by plotting various plot for the data which will show how the variables interact with each other across multiple dimensions.
1. Pair Plot for showing the distribution of the individual variables
Python
sns.set_palette("Pastel1")
plt.figure(figsize=(10, 6))
sns.pairplot(df)
plt.suptitle('Pair Plot for DataFrame')
plt.show()
Output:
Pair Plot- If the plot is diagonal , histograms of kernel density plots shows the distribution of the individual variables.
- If the scatter plot is in the lower triangle, it displays the relationship between the pairs of the variables.
- If the scatter plots above and below the diagonal are mirror images indicating symmetry.
- If the histogram plots are more centered, it represents the locations of peaks.
- Skewness is found by observing whether the histogram is symmetrical or skewed to the left or right.
2. Violin Plot for examining the relationship between alcohol and Quality.
Python
df['quality'] = df['quality'].astype(str)
plt.figure(figsize=(10, 8))
sns.violinplot(x="quality", y="alcohol", data=df, palette={
'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)
plt.title('Violin Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
Output:
Violin PlotFor interpreting the Violin Plot:
- If the width is wider, it shows higher density suggesting more data points.
- Symmetrical plot shows a balanced distribution.
- Peak or bulge in the violin plot represents most common value in distribution.
- Longer tails shows great variability.
- Median line is the middle line inside the violin plot. It helps in understanding central tendencies.
3. Box Plot for examining the relationship between alcohol and Quality
Python
sns.boxplot(x='quality', y='alcohol', data=df)
Output:
Box PlotBox represents the IQR i.e longer the box, greater the variability.
- Median line in the box shows central tendency.
- Whiskers extend from box to the smallest and largest values within a specified range.
- Individual points beyond the whiskers represents outliers.
- A compact box shows low variability while a stretched box shows higher variability.
Step 8: Multivariate Analysis
It involves finding the interactions between three or more variables in a dataset at the same time. This approach focuses to identify complex patterns, relationships and interactions which provides understanding of how multiple variables collectively behave and influence each other.
Here, we are going to show the multivariate analysis using a correlation matrix plot.
Python
plt.figure(figsize=(15, 10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)
plt.title('Correlation Heatmap')
plt.show()
Output:
Correlation MatrixValues close to +1 shows strong positive correlation, -1 shows a strong negative correlation and 0 suggests no linear correlation.
- Darker colors signify strong correlation, while light colors represents weaker correlations.
- Positive correlation variable move in same directions. As one increases, the other also increases.
- Negative correlation variable move in opposite directions. An increase in one variable is associated with a decrease in the other.
With these insights from the EDA, we are now ready to undertsand the data and explore more advanced modeling techniques.
Similar Reads
Data Analysis (Analytics) Tutorial
Data Analysis or Data Analytics is studying, cleaning, modeling, and transforming data to find useful information, suggest conclusions, and support decision-making. This Data Analytics Tutorial will cover all the basic to advanced concepts of Excel data analysis like data visualization, data preproc
7 min read
Prerequisites for Data Analysis
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn
Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
4 min read
SQL for Data Analysis
SQL (Structured Query Language) is an indispensable tool for data analysts, providing a powerful way to query and manipulate data stored in relational databases. With its ability to handle large datasets and perform complex operations, SQL has become a fundamental skill for anyone involved in data a
7 min read
Python | Math operations for Data analysis
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.There are some important math operations that can be performed on a pandas series to si
2 min read
Python - Data visualization tutorial
Data visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
7 min read
Free Public Data Sets For Analysis
Data analysis is a crucial aspect of modern decision-making processes across various domains, including business, academia, healthcare, and government. However, obtaining high-quality datasets for analysis can be challenging and costly. Fortunately, there are numerous free public datasets available
5 min read
Understanding the Data
What is Data ?
Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format.In this article, we will learn about What is Data, the Types of Data, Importance of Data, and
9 min read
Understanding Data Attribute Types | Qualitative and Quantitative
When we talk about data mining , we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes knowing about data, finding relations between data. And for this, we need to discus
6 min read
Univariate, Bivariate and Multivariate data and its analysis
In this article,we will be discussing univariate, bivariate, and multivariate data and their analysis. Univariate data: Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a s
5 min read
Attributes and its Types in Data Analytics
In this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj
4 min read
Data Cleaning
What is Data Cleaning?
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
ML | Overview of Data Cleaning
Data cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
Best Data Cleaning Techniques for Preparing Your Data
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality, accuracy, and reliability for analysis or other applications. It involves several steps aimed at detecting and r
6 min read
Handling Missing Data
Working with Missing Data in Pandas
In Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
Drop rows from Pandas dataframe with missing values or NaN in columns
We are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column,
4 min read
Count NaN or missing values in Pandas DataFrame
In this article, we will see how to Count NaN or missing values in Pandas DataFrame using isnull() and sum() method of the DataFrame. 1. DataFrame.isnull() MethodDataFrame.isnull() function detect missing values in the given object. It return a boolean same-sized object indicating if the values are
3 min read
ML | Handling Missing Values
Missing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impar
12 min read
Working with Missing Data in Pandas
In Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
ML | Handle Missing Data with Simple Imputer
SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. It is implemented by the use of the SimpleImputer() method which takes the following arguments : missing_values : The missing_
2 min read
How to handle missing values of categorical variables in Python?
Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the cha
4 min read
Replacing missing values using Pandas in Python
Dataset is a collection of attributes and rows. Data set can have missing data that are represented by NA in Python and in this article, we are going to replace missing values in this article We consider this data set: Dataset data set In our data contains missing values in quantity, price, bought,
2 min read
Exploratory Data Analysis
Time Series Data Analysis