0% found this document useful (0 votes)
1 views

data_preprocess_steps

The document outlines a step-by-step process for data preprocessing and exploratory analysis, starting with reading data and cleaning it by removing unnecessary columns and checking data types. It emphasizes the importance of transforming categorical and numerical data appropriately, handling duplicates and missing values, and conducting univariate and bivariate analyses to explore relationships between variables. Finally, it discusses feature selection methods to evaluate the importance of individual and multiple features in relation to the target variable.

Uploaded by

Akshay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

data_preprocess_steps

The document outlines a step-by-step process for data preprocessing and exploratory analysis, starting with reading data and cleaning it by removing unnecessary columns and checking data types. It emphasizes the importance of transforming categorical and numerical data appropriately, handling duplicates and missing values, and conducting univariate and bivariate analyses to explore relationships between variables. Finally, it discusses feature selection methods to evaluate the importance of individual and multiple features in relation to the target variable.

Uploaded by

Akshay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

below it the notes i have collected from various source, prepare notes using below

points and summarize it step by step

1. data read, shape, sample, remove unnecessary columns (eg. ID, Name etc),
describe() data, info(), nunique()
2. check datatype of every column with the sample value in that column
* check for numerical column should be int64 / float64 type and categorical
should be categorical
* if datatype is object for numerical col then convert it using pd.to_numeric
which will replace the non numerical values to NaN.
* categorical columns in X (independent variable) should be converted numbers
using one hot encoding and categorical values in Y (target variable) should also be
converted to numeric using manual replace or label encoding , but if we use label
encoding it produces hierarch randomly given more import to other class give biased
result in prediction.

* Data Type Best Transformation


Ordinal Data (e.g., Education Level: Low < Medium < High) LabelEncoder
or .astype('category') Keeps the order intact.
Nominal Data (e.g., Origin: USA, Europe, Asia) One-Hot Encoding
(pd.get_dummies()) Prevents false ordinal relations.
Tree Models (e.g., Random Forest, XGBoost) Leave as is (or
LabelEncoder) Tree models handle categories naturally.

Nominal Data – Nominal data is a basic data type that categorizes data by
labeling or naming values such as Gender, hair color, or types of animal. It does
not have any hierarchy.
Ordinal Data – Ordinal data involves classifying data based on rank, such as
social status in categories like ‘wealthy’, ‘middle income’, or ‘poor’. However,
there are no set intervals between these categories.

When you encounter a column in a DataFrame with numerical values like 1, 2,


and 3, how you handle it depends on whether these values represent categorical or
continuous data:

✅ 1. If the values represent categories (e.g., Education Level: 1 = High


School, 2 = Bachelor’s, 3 = Master’s):
In this case, the numbers are simply labels for different groups, and
treating them as continuous variables would be incorrect. You should convert them
into categorical variables:
3. check for duplicate (duplicate = df[df.duplicated()]) and to drop
df.drop_duplicates(keep='first', ignore_index=True,inplace=True)
4. check for missing values and handle it using SimpleImputer, median if outliers
are present for continuous value (how to know outilers check df.describe() mean and
50% values has huge difference) and mode for categorical value.
* some times missing values can be in the form of zero's (eg; blood pressure
can't be zeros for a person) handle it with simpleImputer, (missing_values=0)
5. Exploratory Analysis:
Univariate Analysis:
Examine individual variables using descriptive statistics (mean,
median, mode, standard deviation) and visualizations (for continuous : histograms,
box plots, violin plot for categorical/discrete :count plot, bar plot, pie chart).
Descriptive statistics: Calculating measures of central tendency (e.g.,
mean, median) and measures of dispersion (e.g., standard deviation, interquartile
range) to summarize the distribution of a variable.
Frequency distributions: Visualizing the frequency or proportion of
each unique value or category within a variable using histograms, bar plots, or pie
charts.
Outlier detection: Identifying extreme or unusual values that deviate
significantly from the rest of the data using various statistical methods, such as
box plots, scatter plots, or Z-scores.

Bivariate Analysis & Multivariate Analysis:


categorical v/s categorical ---> Heatmap, stacked bar plot, grouped bar
plot
categorical v/s continuous ---> Boxplot
continuous v/s continuous ---> pair plot, scatter plot

Explore relationships among multiple variables using techniques SNS


pair plot and heatmaps.
Correlation analysis: Measuring the strength and direction of the
linear relationship between pairs of variables using correlation coefficients
(e.g., Pearson’s correlation, Spearman’s rank correlation).
Scatter plots and pair plots: Visualizing the relationships between
pairs of variables using scatter plots or pair plots, which can help identify
patterns, clusters, or potential multicollinearity issues.
Dimensionality reduction: Techniques like Principal Component Analysis
(PCA) or t-SNE (t-Distributed Stochastic Neighbor Embedding) can be used to reduce
the dimensionality of the data and visualize the relationships between multiple
variables in a lower-dimensional space.
Clustering analysis: Identifying natural groupings or clusters within
the data based on similarities or dissimilarities between observations, which can
reveal underlying patterns or structures.

6. Feature Selection and Importance


Univariate feature selection: Evaluating the relationship between each
individual feature and the target variable using statistical tests (e.g., chi-
square, ANOVA) or measures of correlation or mutual information.
Multivariate feature selection: Considering the joint effects of multiple
features on the target variable using techniques like recursive feature
elimination, regularization methods (e.g., Lasso, Ridge), or tree-based importance
measures.
Feature importance visualization: Creating plots or charts that visualize the
relative importance or contribution of each feature to the target variable or model
performance.

You might also like