定义
特征工程 (Feature engineering):
- (From Wikipedia) the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
常见的特征工程
异常处理
箱线图(或3-Sigma)分析
通过箱线图(或 3-Sigma)分析删除异常值
Notes:
- 四分位距(IQR)= Q3 - Q1
- 上限 ---- 非异常范围内的最大值。上限=Q3 +1.5 * IQR
- 下限 ---- 非异常范围内的最小值。下限=Q1 -1.5* IQR
- 为什么 箱线图 可以识别异常值
- 线图判断异常值的标准以四分位数和四分位距为基础,
- 四分位数具有一定的耐抗性,多达25%的数据可以变得任意远而不会很大地扰动四分位数,所以异常值不会影响箱形图的数据形状,箱线图识别异常值的结果比较客观。
- 由此可见,箱线图在识别异常值方面有一定的优越性
- 异常值越多说明尾部越重,自由度越小(即自由变动的量的个数);
- 而偏态表示偏离程度,异常值集中在较小值一侧,则分布呈左偏态;异常值集中在较大值一侧,则分布呈右偏态。
以下代码转载 知乎:阿泽 https://www.zhihu.com/people/is-aze
# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data,col_name,scale = 3):
"""
用于清洗异常值,默认用 box_plot(scale=3)进行清洗
:param data: 接收 pandas 数据格式
:param col_name: pandas 列名
:param scale: 尺度
:return:
"""
def box_plot_outliers(data_ser,box_scale):
"""
利用箱线图去除异常值
:param data_ser: 接收 pandas.Series 数据格式
:param box_scale: 箱线图尺度,
:return:
"""
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
val_low = data_ser.quantile(0.25) - iqr
val_up = data_ser.quantile(0.75) + iqr
rule_low = (data_ser < val_low)
rule_up = (data_ser > val_up)
return (rule_low, rule_up),(val_low,val_up)
data_n = data.copy()
data_series = data_n[col_name]
rule,value = box_plot_outliers(data_series,box_scale = scale)
index = np.arange(data_series.shape[0])[rule[0]|rule[1]]
print("Delete number is: {}".format(len(index)))
data_n = data_n.drop(index)
data_n.reset_index(drop = True, inplace=True)
print("Now column number is: {}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule[0]]
outliers = data_series.iloc[index_low]
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe())
index_up = np.arange(data_series.shape[0])[rule[1]]
outliers = data_series.iloc[index_up]
print("Description of data larger than the upper bound is:")
print(pd.Series(outliers).describe())
fig, ax = plt.subplots(1, 2, figsize=(10,