import pandas as pdimport ospath = 'path/to/folder'excel_files = os.listdir(path)df_list = []for file in excel_files: if file.endswith('.xlsx'): df = pd.read_excel(os.path.join(path, file)) df_list.append(df)merged_df = pd.concat(df_list, ignore_index=True)deduplicated_df = merged_df.drop_duplicates()deduplicated_df.to_excel('path/to/output/file.xlsx', index=False)

这段代码是用于读取一个文件夹下的所有 Excel 文件，将它们合并为一个 Pandas DataFrame，并且去除其中的重复行，最后将结果保存为一个新的 Excel 文件。具体步骤如下： 1. 首先，导入 pandas 和 os 模块。 2. 设置变量 path 为待读取的文件夹的路径。 3. 使用 os.listdir(path) 函数列出 path 文件夹下的所有文件和文件夹，并将它们存储在 excel_files 列表中。 4. 创建一个空的 DataFrame 列表 df_list 用于存储每个 Excel 文件读取后得到的 DataFrame。 5. 遍历 excel_files 中的每个文件，如果文件名以 .xlsx 结尾，则读取该 Excel 文件，将其存储为 DataFrame，并将 DataFrame 添加到 df_list 列表中。 6. 使用 pd.concat() 函数将 df_list 中的所有 DataFrame 合并为一个 DataFrame，并将 ignore_index 参数设置为 True，以重置索引。 7. 对合并后的 DataFrame 进行去重，使用 drop_duplicates() 函数，该函数删除 DataFrame 中的重复行。 8. 最后，使用 to_excel() 函数将去重后的 DataFrame 保存为一个新的 Excel 文件，同时将 index 参数设置为 False，以避免将索引作为列写入 Excel 文件。

import osimport reimport pandas as pd# 正则表达式regex = re.compile(r'^2-.\.xlsx$')continuation_regex = re.compile(r'^\d+\s.$') # 这里假设续表的第一列为数字加空格# 读取所有符合条件的xlsx文件files = [f for f in os.listdir('.') if os.path.isfile(f) and regex.match(f)]dfs = []for file in files: df = pd.read_excel(file, sheet_name=None, engine='openpyxl') for sheet_name, sheet_df in df.items(): # 判断续表并添加到之前的表格中 if sheet_df.iloc[0, 0] and continuation_regex.match(str(sheet_df.iloc[0, 0])): dfs[-1] = pd.concat([dfs[-1], sheet_df]) else: dfs.append(sheet_df)# 合并所有表格merged_df = pd.concat(dfs)# 根据城市为键合并所有表格grouped_df = merged_df.groupby('城市').agg(lambda x: x.tolist())# 将所有列表转换为字符串并重新转换为DataFramegrouped_df = grouped_df.applymap(lambda x: ','.join([str(i) for i in x]))grouped_df = pd.DataFrame(grouped_df.to_dict())# 输出结果print(grouped_df) 修改为遍历指定列表

如果想要修改这段代码，使其不再遍历当前文件夹下的xlsx文件，而是遍历指定文件夹下的xlsx文件，可以将os.listdir('.')替换为os.listdir('指定文件夹路径')，例如： ``` import os import re import pandas as pd # 正则表达式 regex = re.compile(r'^2-.*\.xlsx$') continuation_regex = re.compile(r'^\d+\s.*$') # 这里假设续表的第一列为数字加空格 # 指定文件夹路径 folder_path = '指定文件夹路径' # 读取所有符合条件的xlsx文件 files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f)) and regex.match(f)] dfs = [] for file in files: df = pd.read_excel(file, sheet_name=None, engine='openpyxl') for sheet_name, sheet_df in df.items(): # 判断续表并添加到之前的表格中 if sheet_df.iloc[0, 0] and continuation_regex.match(str(sheet_df.iloc[0, 0])): dfs[-1] = pd.concat([dfs[-1], sheet_df]) else: dfs.append(sheet_df) # 合并所有表格 merged_df = pd.concat(dfs) # 根据城市为键合并所有表格 grouped_df = merged_df.groupby('城市').agg(lambda x: x.tolist()) # 将所有列表转换为字符串并重新转换为DataFrame grouped_df = grouped_df.applymap(lambda x: ','.join([str(i) for i in x])) grouped_df = pd.DataFrame(grouped_df.to_dict()) # 输出结果 print(grouped_df) ``` 其中，需要将'指定文件夹路径'替换为实际的文件夹路径。

D:\anaconda\lib\site-packages\pandas\core\series.py:4494: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(result)

这个警告是由于在使用`.iloc`进行索引操作时，对切片的副本进行了赋值操作。虽然这样的赋值操作通常是有效的，但是在某些特定情况下，可能会引发警告。通常情况下，这个警告是提醒你可能正在修改原始数据的副本，而不是原始数据本身。要解决这个问题，可以使用`.loc`替代`.iloc`来进行索引操作。`.loc`操作符可以直接在原数据上进行修改，而不会引发警告。下面是一个示例代码： ```python import pandas as pd import os # 文件夹路径 folder_path = r'D:\Download\11' # 获取文件夹中的所有Excel文件 file_list = os.listdir(folder_path) excel_files = [file for file in file_list if file.endswith('.xlsx') or file.endswith('.xls')] # 循环处理每个Excel文件 for file in excel_files: file_path = os.path.join(folder_path, file) # 读取Excel文件 df = pd.read_excel(file_path) # 删除第三行中的空值 df.loc[2].dropna(inplace=True) # 保存修改后的Excel文件 df.to_excel(file_path, index=False) ``` 在这个示例代码中，替换了原来的`df.iloc[2]`为`df.loc[2]`来选择第三行，并进行删除操作。这样就可以避免警告出现。请注意，这个警告只是提醒你可能对原始数据的副本进行了修改，并不一定会导致错误。如果你确认你的代码逻辑是正确的，并且不希望看到这个警告，请在代码开始时加上以下两行代码来忽略警告： ```python import warnings warnings.filterwarnings('ignore') ``` 这样就可以在运行时忽略这个警告了。

相关推荐

详解pandas库pd.read_excel操作读取excel文件参数整理与实例

解决pandas .to_excel不覆盖已有sheet的问题

import pandas as pd.docx

问题需求，需要把N个excel文件合并成一个excel文件，并对excel合并内容进行去重，针对此需求用python写一个脚本

pandas读特定路径下的文件，并将数据拼接在一块存入excel中

Python遍历文件夹EXCEL 并合并数据

帮我写一个python的代码，将文件夹里的所有excel表合并在一起

使用pandas依次读取一个文件夹下的所有数据格式相同的csv文件并合并到一个df里面，写出具体代码

读取一个文件夹下所有csv文件中指定列名的信息，并且写入到一个excel中

jupyter notebook批量读取Excel

python合并文件夹里多个excel

python查看已经由os库读取的365个cdf文件的内容，并汇总成集打印出来

python 将某一文件夹下多个excel的sheet合并到一个excel文件的不通shee

python将文件夹下多个txt数据写入excel

python批量处理多个excel文件

多个txt文件转换为一个Excel文件

读取多个excel表格，并删除每个excel表格的第三行中的空值，用jupyter实现

最新推荐

setuptools-33.1.1-py2.py3-none-any.whl

zigbee-cluster-library-specification

管理建模和仿真的文件

实现实时数据湖架构：Kafka与Hive集成

解释minorization-maximization (MM) algorithm，并给出matlab代码编写的例子

JSBSim Reference Manual

"互动学习：行动中的多样性与论文攻读经历"

实现实时监控告警系统：Kafka与Grafana整合

帮我实现在Androidstudio调用chapgpt并提供源码

c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf