data_preprocess_steps
data_preprocess_steps
1. data read, shape, sample, remove unnecessary columns (eg. ID, Name etc),
describe() data, info(), nunique()
2. check datatype of every column with the sample value in that column
* check for numerical column should be int64 / float64 type and categorical
should be categorical
* if datatype is object for numerical col then convert it using pd.to_numeric
which will replace the non numerical values to NaN.
* categorical columns in X (independent variable) should be converted numbers
using one hot encoding and categorical values in Y (target variable) should also be
converted to numeric using manual replace or label encoding , but if we use label
encoding it produces hierarch randomly given more import to other class give biased
result in prediction.
Nominal Data – Nominal data is a basic data type that categorizes data by
labeling or naming values such as Gender, hair color, or types of animal. It does
not have any hierarchy.
Ordinal Data – Ordinal data involves classifying data based on rank, such as
social status in categories like ‘wealthy’, ‘middle income’, or ‘poor’. However,
there are no set intervals between these categories.