机器学习/深度学习实战——kaggle房价预测比赛实战（数据预处理篇）

留小星

已于 2023-08-09 21:56:17 修改

阅读量3.4k

点赞数 2

CC 4.0 BY-SA版权

分类专栏：动手学深度学习：pytorch 文章标签：数据预处理数据分析深度学习房价预测 sklearn

于 2021-08-04 11:18:24 首次发布

本文链接：https://blog.csdn.net/jerry_liufeng/article/details/119379598

动手学深度学习：pytorch 专栏收录该内容

74 篇文章

订阅专栏

本文详细介绍了房价预测项目中的数据预处理步骤，包括特征选择、数值与非数值数据的缺失值填充、特征转换及编码、特征融合等操作，并最终完成了训练集与测试集的划分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

- 2. 数据预处理（特征编码）

数据下载地址：百度网盘提取码： w2t6

2. 数据预处理（特征编码）

2.1 特征删除

1） ID
2）缺失数据过多的特征
3）主要是0或者是null或者其他单值的特征

# id不是特征，将id去除,同时将训练数据和测试数据放在仪器做预处理
# Label是上面为了区别训练数据和测试数据加入的一个特征，实际训练也不需要，需要去除
# 如Alley、fence等特征属于上面数据缺失过多或者主要是0/null的情况
drop_columns = ["Id", "Alley", "Fence", "LotFrontage", "FireplaceQu", "PoolArea", "LowQualFinSF", "3SsnPorch", "MiscVal", 'RoofMatl','Street','Condition2','Utilities','Heating','Label']
print("Number of columns before dropping : ",len(combined_df.columns))
print("Number of dropping columns : ",len(drop_columns))
combined_df.drop(columns=drop_columns, inplace=True, errors='ignore')
print("Number of columns after dropping : ",len(combined_df.columns))

Number of columns before dropping :  82
Number of dropping columns :  15
Number of columns after dropping :  67

2.2 修改与时间相关的特征（减小特征值的大小）

（用售卖时间减去build/remodadd/garageBuild的时间）

for feature in ['YearBuilt','YearRemodAdd','GarageYrBlt']:
    combined_df[feature]=combined_df['YrSold']-combined_df[feature]
combined_df[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()

	YearBuilt	YearRemodAdd	GarageYrBlt
0	5	5	5.00
1	31	31	31.00
2	7	6	7.00
3	91	36	8.00
4	8	8	8.00

2.3 填充缺失值

2.3.1 填充数值型数据

有不同的填充方法：补零、填充中位值、填充均值等等

for col in null_features_numerical:
    if col not in drop_columns:
#         combined_df[col] = combined_df[col].fillna(combined_df[col].mean()) # 填充均值
#         combined_df[col] = combined_df[col].fillna(combined_df[col].median()) # 填充中位值
        combined_df[col] = combined_df[col].fillna(0.0) # 用0简单填充

2.3.2 填充非数值型数据

两种不同的填充方法（主要是根据之前进行的数据分析进行的填充）：

直接填充上‘NA’，等价于加了新的一类
填充最多的值的类别，将缺失类归属于主要类别之中

null_features_categorical = [col for col in combined_df.columns if combined_df[col].isnull().sum() > 0 and col in categorical_features]

# 对这些特征填充主要类别的值
cat_feature_mode = ["SaleType", "Exterior1st", "Exterior2nd", "KitchenQual", "Electrical", "Functional"]

for col in null_features_categorical:
    if col != 'MSZoning' and col not in cat_feature_mode:
        combined_df[col] = combined_df[col].fillna('NA')
    else:
        combined_df[col] = combined_df[col].fillna(combined_df[col].mode()[0])

2.4 将某些数值型特征转换为非数值型特征

MSSubClass属性虽然是数值类型，但是其类别较少，根据

# Convert "numerical" feature to categorical
convert_list = ['MSSubClass']
for col in convert_list:
    combined_df[col] = combined_df[col].astype('str')

2.5 对某些连续特征应用PowerTransformer使其更具高斯分布

我们可以发现有的连续特征是分线性的，所以我们需要对这些数据进行一定转换
对这些连续数据特征的偏度进行检查

# get the features except object types
numeric_features = combined_df.dtypes[combined_df.dtypes != 'object'].index

# check the skewness of all numerical features
skewed_features = combined_df[numeric_features].apply(lambda x:skew(x.dropna())).sort_values(ascending=False)

print('\n Skew in numberical features: \n')
skewness_df = pd.DataFrame({'Skew' : skewed_features})
print(skewness_df.head(10))

 Skew in numberical features: 

               Skew
LotArea       12.82
KitchenAbvGr   4.30
BsmtFinSF2     4.15
EnclosedPorch  4.00
ScreenPorch    3.95
BsmtHalfBath   3.93
MasVnrArea     2.61
OpenPorchSF    2.54
WoodDeckSF     1.84
1stFlrSF       1.47

# Apply PowerTransformer to Columns
log_list = ['BsmtUnfSF', 'LotArea', '1stFlrSF', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']

for col in log_list:
    power = PowerTransformer(method='yeo-johnson', standardize=True)
    combined_df[[col]] = power.fit_transform(combined_df[[col]]) # fit with combined_data to avoid overfitting with training data?

print('Number of skewed numerical features got transform : ', len(log_list))

Number of skewed numerical features got transform :  6

2.6 对非数值特征中的某些特征进行融合

某些object类型特征中的一些属性相对主要类别占比非常小，考虑将这些次要属性融合在一起。比如在HeatingQC特征之中的Fa和Po特征占比非常小，所以将这两个统称为 other

# Regroup features
# 下面这些特征中的类别占比非常小，可以考虑将这些类别统称为other
regroup_dict = {
#     'LotConfig': ['FR2','FR3'],
#     'LandSlope':['Mod','Sev'],
#     'BldgType':['2FmCon','Duplex'],
#     'RoofStyle':['Mansard','Flat','Gambrel'],
#     'Electrical':['FuseF','FuseP','FuseA','Mix'],
#     'SaleCondition':['Abnorml','AdjLand','Alloca','Family'],
#     'BsmtExposure':['Min','Av'],
#     'Functional':['Min1','Maj1','Min2','Mod','Maj2','Sev'],
#     'LotShape':['IR2','IR3'],
    'HeatingQC':['Fa','Po'],
    # 'FireplaceQu':['Fa','Po'],
    'GarageQual':['Fa','Po'],
    'GarageCond':['Fa','Po'],
}
 

for col, regroup_value in regroup_dict.items():
    mask = combined_df[col].isin(regroup_value)
    combined_df[col][mask] = 'Other'

2.7 对非数值型特征进行编码

# Generate one-hot dummy columns
combined_df = pd.get_dummies(combined_df).reset_index(drop=True)

2.8 拆分训练数据和测试数据

new_train_data = combined_df.iloc[:len(train_data), :]
new_test_data = combined_df.iloc[len(train_data):, :]
X_train = new_train_data.drop('SalePrice', axis=1)
y_train = np.log1p(new_train_data['SalePrice'].values.ravel())
X_test = new_test_data.drop('SalePrice', axis=1)

# 使用sklearn中的RoubstScaler函数对异常值鲁棒性的统计信息（中位数和四分位数）进行缩放特征
pre_precessing_pipeline = make_pipeline(RobustScaler(), 
                                        # VarianceThreshold(0.001),
                                       )

X_train = pre_precessing_pipeline.fit_transform(X_train)
X_test = pre_precessing_pipeline.transform(X_test)

print(X_train.shape)
print(X_test.shape)