特征工程：异常值处理与Box Plot方法

196 浏览量更新于2024-08-28 1 收藏 532KB PDF 举报

"特征工程是数据分析与机器学习中的关键步骤，其主要目标是通过转换和优化原始数据，提取出对模型预测或学习有价值的特征。在Task3中，重点关注的是数据预处理，特别是异常值处理，这是确保模型稳健性和准确性的重要环节。通常，数据预处理会使用Python的sklearn库，该库提供了丰富的工具和方法来处理数据。异常值处理是数据预处理的关键部分，因为异常值可能严重影响模型的训练和性能。离群点（off-group points）是指那些远离数据集中其他点的数据点，可能是由于测量错误、记录失误或是真实存在的极端情况。在本任务中，使用箱型图（Box Plot）或小提琴图（Violin Plot）来可视化数据分布，以便识别这些离群点。一旦发现离群点，通常会选择删除或替换它们，以防止它们对分析结果造成偏见。提供的代码段展示了如何定义一个名为`outliers_proc`的函数，用于清洗异常值。该函数接受一个包含pandas DataFrame的数据集以及要处理的列名，并可选择性地指定一个尺度参数`scale`，默认值为3。这个尺度参数通常用于计算四分位距（IQR），它是箱型图中用于识别离群点的标准度量。函数内部，定义了一个辅助函数`box_plot_outliers`，它接受一个pandas Series对象，并计算其四分位数和IQR。根据IQR，确定上下限（val_low 和 val_up），任何低于val_low或高于val_up的数据点都将被视为离群点。然后，使用这些规则找到需要删除的索引，并从原始数据集中移除这些离群点。在`outliers_proc`函数中，首先创建数据集的副本`data_n`，以防止对原始数据造成破坏。接着，调用`box_plot_outliers`函数，得到离群点的索引，并删除这些索引对应的行。最后，更新数据集的索引并打印出删除的离群点数量以及处理后的数据集行数，以确认处理过程。特征工程的这个阶段涉及了数据质量的检查和提升，尤其是针对异常值的处理，这有助于构建更准确的模型，并减少因异常值引起的噪声和偏差。在实际应用中，除了异常值处理，特征工程还包括数据标准化、缺失值填充、特征选择、特征构造等多个方面，每个步骤都对最终模型的性能有着深远影响。"

Task3 特征工程特征工程

一、数据预处理一、数据预处理

在这一块，比较常用的包是sklearn.Processing data,主要包括以下操作：

异常值处理异常值处理

使用箱型图（或小提琴图）发现离群点（off-group points）之后，为了不干扰实验结果，我们通常将离群点处理掉：

#from DW阿泽 import the code

def outliers_proc(data, col_name, scale=3):

"""

用于清洗异常值，默认用 box_plot（scale=3）进行清洗

:param data: 接收 pandas 数据格式

:param col_name: pandas 列名

:param scale: 尺度

:return:

"""

def box_plot_outliers(data_ser, box_scale):

"""

利用箱线图去除异常值

:param data_ser: 接收 pandas.Series 数据格式

:param box_scale: 箱线图尺度，

:return:

"""

iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))

val_low = data_ser.quantile(0.25) - iqr

val_up = data_ser.quantile(0.75) + iqr

rule_low = (data_ser val_up)

return (rule_low, rule_up), (val_low, val_up)

data_n = data.copy()

data_series = data_n[col_name] rule, value = box_plot_outliers(data_series, box_scale=scale)

index = np.arange(data_series.shape[0])[rule[0] | rule[1]] print("Delete number is: {}".format(len(index)))

data_n = data_n.drop(index)

data_n.reset_index(drop=True, inplace=True)

print("Now column number is: {}".format(data_n.shape[0]))

index_low = np.arange(data_series.shape[0])[rule[0]] outliers = data_series.iloc[index_low] print("Description of data less than the lower bound is:")

print(pd.Series(outliers).describe())

index_up = np.arange(data_series.shape[0])[rule[1]] outliers = data_series.iloc[index_up] print("Description of data larger than the upper bound is:")

print(pd.Series(outliers).describe())

fig, ax = plt.subplots(1, 2, figsize=(10, 7))

sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])

sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])

return data_n

缺失值处理缺失值处理

像IRIS数据集没有缺失值，故对数据集新增一个特征，4个特征均赋值为NaN，表示数据缺失；

用均值、众数、中位数填充；

用正态分布进行填充；

sklearn.processing import Imputer 这是sklearn中的处理特征缺失的类；

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38626179

粉丝: 4
资源: 959

特征工程：异常值处理与Box Plot方法

task03-特征工程.md

Task

epam_task3

Task3-伺服电机

Midas-Task-3

job task and task

Module3_Task1

ELTEX_M2_task3_GIT

task-manager:COMP354软件工程任务管理系统

COMP354软件工程项目：task-manager系统介绍

最新资源