dtypes_list=data.dtypes.values columns_list=data.columns for i in range(len(columns_list)): if dtypes_list[i]=='object': lb=LabelEncoder() lb.fit(data[columns_list[i]]) data[columns_list[i]]=lb.transform(data[columns_list[i]]) data.head()

这是一段 Python 代码，它使用 LabelEncoder 对一个 Pandas DataFrame 数据集中的类别特征（object）进行编码。具体来说，代码首先获取数据集的列名和每列的数据类型，然后遍历每个列，对数据类型为 object 的列进行编码。对于每个 object 列，代码新建一个 LabelEncoder 对象，使用 fit 函数对该列进行拟合，然后使用 transform 函数对该列进行编码转换。最后，代码将编码后的结果覆盖掉原始的 object 列，使得数据集中所有的类别特征都被转换为数值型特征。最后一行代码调用 head 函数，输出编码后的数据集前 5 行。

data 是datafame格式，def data_processing(data): # 日期缺失，补充 data.fillna(method='ffill', inplace=True) date_history = pd.DataFrame(data.iloc[:, 0]) data_history = pd.DataFrame(data.iloc[:, 1]) date_history = np.array(date_history) data_history = [x for item in np.array(data_history).tolist() for x in item] # 缺失值处理 history_time_list = [] for date in date_history: date_obj = datetime.datetime.strptime(date[0], '%Y/%m/%d %H:%M') #将字符串转为 datetime 对象 history_time_list.append(date_obj) start_time = history_time_list[0] # 起始时间 end_time = history_time_list[-1] # 结束时间 delta = datetime.timedelta(minutes=15) #时间间隔为15分钟 time_new_list = [] current_time = start_time while current_time <= end_time: time_new_list.append(current_time) current_time += delta # 缺失位置记录 code_list = [] for i in range(len(time_new_list)): code_list = code_list history_time_list = history_time_list while (time_new_list[i] - history_time_list[i]) != datetime.timedelta(minutes=0): history_time_list.insert(i, time_new_list[i]) code_list.append(i) for i in code_list: data_history.insert(i, data_history[i - 1]) # 输出补充好之后的数据 data = pd.DataFrame({'date': time_new_list, 'load': data_history}) return data 优化代码

可以优化的部分如下： 1. 将 date_history 和 data_history 的赋值语句合并为一行，即 `date_history, data_history = data.iloc[:, :2].values.T`。 2. 不需要将 date_history 转换为 numpy array，因为 iloc 输出的已经是 numpy array 类型了。 3. 在处理日期缺失时，可以使用 pandas 的 resample 函数来实现时间间隔的补充，避免手动循环。 4. 在处理缺失位置时，可以使用 pandas 的 interpolate 函数来进行插值。优化后的代码如下所示： ```python def data_processing(data): # 日期缺失，补充 data.fillna(method='ffill', inplace=True) date_history, data_history = data.iloc[:, :2].values.T # 转换为 datetime 对象 date_history = pd.to_datetime(date_history, format='%Y/%m/%d %H:%M') # 时间间隔为15分钟，使用 resample 补充缺失数据 data = pd.DataFrame({'load': data_history}, index=date_history) data = data.resample('15T').ffill() # 使用 interpolate 函数进行插值 data['load'] = data['load'].interpolate() # 输出补充好之后的数据 data.reset_index(inplace=True) data.rename(columns={'index': 'date'}, inplace=True) return data ```

优化代码 def cluster_format(self, start_time, end_time, save_on=True, data_clean=False, data_name=None): """ local format function is to format data from beihang. :param start_time: :param end_time: :return: """ # 户用簇级数据清洗 if data_clean: unused_index_col = [i for i in self.df.columns if 'Unnamed' in i] self.df.drop(columns=unused_index_col, inplace=True) self.df.drop_duplicates(inplace=True, ignore_index=True) self.df.reset_index(drop=True, inplace=True) dupli_header_lines = np.where(self.df['sendtime'] == 'sendtime')[0] self.df.drop(index=dupli_header_lines, inplace=True) self.df = self.df.apply(pd.to_numeric, errors='ignore') self.df['sendtime'] = pd.to_datetime(self.df['sendtime']) self.df.sort_values(by='sendtime', inplace=True, ignore_index=True) self.df.to_csv(data_name, index=False) # 调用基本格式化处理 self.df = super().format(start_time, end_time) module_number_register = np.unique(self.df['bat_module_num']) # if registered m_num is 0 and not changed, there is no module data if not np.any(module_number_register): logger.logger.warning("No module data!") sys.exit() if 'bat_module_voltage_00' in self.df.columns: volt_ref = 'bat_module_voltage_00' elif 'bat_module_voltage_01' in self.df.columns: volt_ref = 'bat_module_voltage_01' elif 'bat_module_voltage_02' in self.df.columns: volt_ref = 'bat_module_voltage_02' else: logger.logger.warning("No module data!") sys.exit() self.df.dropna(axis=0, subset=[volt_ref], inplace=True) self.df.reset_index(drop=True, inplace=True) self.headers = list(self.df.columns) # time duration of a cluster self.length = len(self.df) if self.length == 0: logger.logger.warning("After cluster data clean, no effective data!") raise ValueError("No effective data after cluster data clean.") self.cluster_stats(save_on) for m in range(self.mod_num): print(self.clusterid, self.mod_num) self.module_list.append(np.unique(self.df[f'bat_module_sn_{str(m).zfill(2)}'].dropna())[0])

Here are some possible optimizations for the given code: 1. Instead of using a list comprehension to find columns with 'Unnamed' in their names, you can use the `filter()` function along with a lambda function to achieve the same result in a more concise way: ``` unused_index_col = list(filter(lambda x: 'Unnamed' in x, self.df.columns)) ``` 2. Instead of dropping duplicates and resetting the index separately, you can use the `drop_duplicates()` function with the `ignore_index` parameter set to `True` to achieve both in one step: ``` self.df.drop_duplicates(inplace=True, ignore_index=True) ``` 3. Instead of using `sys.exit()` to terminate the program when there is no module data, you can raise a `ValueError` with an appropriate error message: ``` raise ValueError("No module data!") ``` 4. Instead of using a series of `if` statements to find the voltage reference column, you can use the `loc` accessor with a boolean mask to select the first column that starts with 'bat_module_voltage': ``` volt_ref_col = self.df.columns[self.df.columns.str.startswith('bat_module_voltage')][0] ``` 5. Instead of using a loop to append a single item to a list, you can use the `append()` method directly: ``` self.module_list.append(np.unique(self.df[f'bat_module_sn_{str(m).zfill(2)}'].dropna())[0]) ``` By applying these optimizations, the code can become more concise and efficient.

阅读全文

dtypes_list=data.dtypes.values columns_list=data.columns for i in range(len(columns_list)): if dtypes_list[i]=='object': lb=LabelEncoder() lb.fit(data[columns_list[i]]) data[columns_list[i]]=lb.transform(data[columns_list[i]]) data.head()

相关推荐

【数据挖掘重要笔记day31】详细完解决iris = sns.load_dataset(‘iris’)出来EmptyDataError: No columns to parse from file报错

w_pub_get_sort_columns.rar_PowerBuilder_pb 控件

Problem_C_Data.rar

大家在看

调制解调文档

煤矿井下图像型早期火灾探测

DZ_Bootloader_Host_App_DZ60_CAN_源码

ETL Automation 使用手册 2.6

SAP各模块字段与表的对应关系

最新推荐

java计算器源码.zip

FRP Manager-V1.19.2

PHP集成Autoprefixer让CSS自动添加供应商前缀

揭秘数字音频编码的奥秘：非均匀量化A律13折线的全面解析

arduino PAJ7620U2

网站啄木鸟：深入分析SQL注入工具的效率与限制

【GPStoolbox使用技巧大全】：20个实用技巧助你精通GPS数据处理

spring boot怎么配置maven

我的个人简历HTML模板解析与应用

3GPP架构深度解析：掌握网络功能与服务框架的关键