回归算法预测房价-Kaggle

最新推荐文章于 2023-10-24 14:21:07 发布

原创

最新推荐文章于 2023-10-24 14:21:07 发布 · 1.4k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #回归算法 #房价预测 #kaggle

该博客详细介绍了如何使用回归算法预测房价，包括查看目标变量、处理异常值、正态变换、缺失值处理、特征工程以及建模分析等步骤。作者采取了Box Cox变换、Log转换来处理偏正态分布的目标变量，并通过删除异常点、Label Encoding、独热编码等方式预处理特征。在建模阶段，使用了多种基模型和集成模型，通过交叉验证和性能度量进行模型优化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

https://segmentfault.com/a/1190000018717280?utm_source=tag-newest

The features engeneering is rather parsimonious (at least compared to some others great scripts) . It is pretty much :

Imputing missing values by proceeding sequentially through the data

Transforming some numerical variables that seem really categorical

Label Encoding some categorical variables that may contain information in their ordering set

**Box Cox Transformation of skewed features (instead of log-transformation) **: This gave me a slightly better result both on leaderboard and cross-validation.

Getting dummy variables for categorical features.

Then we choose many base models (mostly sklearn based models + sklearn API of DMLC’s XGBoost and Microsoft’s LightGBM), cross-validate them on the data before stacking/ensembling them. The key here is to make the (linear) models robust to outliers. This improved the result both on LB and cross-validation.

To my surprise, this does well on LB ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )