DL_EDA_process
DL_EDA_process
Exploratory Data Analysis (EDA) in machine learning is the process of analyzing and
visualizing datasets to understand their main characteristics, identify patterns, detect anomalies,
and check assumptions. EDA helps prepare data for modeling by revealing insights and guiding
feature selection, data transformation, and pre-processing steps. Here’s an overview of EDA’s
purpose and components:
● Data Summary: Get a high-level overview of the data, including the number of rows,
columns, and data types.
● Data Types and Format: Identify the data types (e.g., numerical, categorical) to
determine which statistical or visualization techniques to apply.
● Null Values: Check for missing values using .isna() and .sum() and decide on
strategies for handling them (imputation, deletion, etc.).
2. Descriptive Statistics
● Central Tendency: Examine measures like mean, median, and mode to understand the
central values of each feature.
● Spread and Range: Check variance, standard deviation, and range to understand how
data points are spread out.
● Distribution: Visualize distributions using histograms, box plots, or density plots to spot
skewness, kurtosis, and outliers.
3. Data Relationships
● Outlier Detection: Use box plots, scatter plots, and z-scores to detect unusual data
points that may need addressing.
5. Feature Engineering Insights
6. Data Cleaning
● Address missing values, incorrect or inconsistent data, and outliers based on insights
gained from EDA.
Overall, EDA is a foundational step in the machine learning pipeline, setting the stage for more
reliable, accurate models.