import os import pdfplumber # 文本保存目录 txts_dir = "./Users/hh/Desktop/python/fileshunan" for path in os.listdir(files_dir): if os.path.isfile(os.path.join(files_dir, path)): file = os.path.join(files_dir, path) print(file) if file.endswith(".pdf"): with pdfplumber.open(file) as pdf: txt_file = open(txts_dir + path[0:-4] + ".txt", mode='w', encoding='utf-8') for page in pdf.pages: txt_file.write(page.extract_text()) elif file.endswith(".html"): soup = BeautifulSoup(open(file, encoding='utf-8')) with open(txts_dir + path[0:-5] + ".txt", mode='w', encoding='utf-8') as txt_file: txt_file.write(soup.get_text()) #将文件保存为文本

时间: 2024-04-13 10:30:00 浏览: 307

根据你提供的代码，你似乎想要将保存在文件夹中的 PDF 和 HTML 文件转换为文本文件。你使用了 `pdfplumber` 库来处理 PDF 文件，并使用 BeautifulSoup 库处理 HTML 文件。首先，请确保你已经导入了以下库： ```python import os import pdfplumber from bs4 import BeautifulSoup ``` 然后，你可以使用以下代码将文件转换为文本： ```python # 文本保存目录 txts_dir = "/Users/hh/Desktop/python/fileshunan/" for path in os.listdir(files_dir): if os.path.isfile(os.path.join(files_dir, path)): file = os.path.join(files_dir, path) print(file) if file.endswith(".pdf"): with pdfplumber.open(file) as pdf: txt_file = open(txts_dir + path[0:-4] + ".txt", mode='w', encoding='utf-8') for page in pdf.pages: txt_file.write(page.extract_text()) txt_file.close() elif file.endswith(".html"): soup = BeautifulSoup(open(file, encoding='utf-8'), 'lxml') with open(txts_dir + path[0:-5] + ".txt", mode='w', encoding='utf-8') as txt_file: txt_file.write(soup.get_text()) ``` 在上述代码中，我们首先定义了保存文本文件的目录路径 `txts_dir`。然后，我们使用 `os.listdir()` 函数遍历指定文件夹中的文件。对于每个文件，我们使用 `os.path.isfile()` 函数来判断是否为文件，并使用 `os.path.join()` 函数获取完整路径。如果文件是以 `.pdf` 结尾的 PDF 文件，我们使用 `pdfplumber.open()` 函数打开 PDF 文件，并使用循环遍历每个页面。我们使用 `page.extract_text()` 方法提取页面的文本内容，并将其写入到以原文件名为基础的新文本文件中。如果文件是以 `.html` 结尾的 HTML 文件，我们使用 BeautifulSoup 打开文件并解析 HTML 内容。然后，我们使用 `soup.get_text()` 方法获取 HTML 的纯文本内容，并将其写入到以原文件名为基础的新文本文件中。请确保你已经设置好了 `txts_dir` 变量，它指定了保存文本文件的目录路径。请注意，以上代码仅供参考，并需要根据你的实际文件路径和需求进行适当的修改。另外，记得导入所需的库。如果你有任何进一步的问题，请随时提问。

阅读全文

相关推荐

sql语句.txtsql语句.txtsql语句.txtsql语句.txtsql语句.txtsql语句.txtsql语句.txts

ADl871型模/数转换器在数据采集系统中的应用

txts：文本

im_show = draw_ocr(image, boxes, txts, scores, font_path='/disk16T/visitor/yushuyang/mmdetection/DIP/unsharp/result.ttf')是什么意思

txtdir = Path().cwd().joinpath('data', 'txts') for file in txtdir.glob('*.txt'): file = str(file) #file = 'data/txts/联美控股2017.txt' text = open(file, encoding='gbk').read() raw = file.split('/')[-1][:-4] name = raw[:-4] year = raw[-4:] print(name, year, text[:10])这段代码的意思

FileNotFoundError: [Errno 2] No such file or directory: 'txts/\n.txt'

报错Traceback (most recent call last): File "D:\uir\test\.idea\aaaa.py", line 80, in <module> txts = ','.join(txts) TypeError: sequence item 0: expected str instance, bytes found

python使用win32com库选择cad中指定图层中的指定多行、单行文本的代码

python paddle ocr

python使用paddleocr

python调用paddleocr模型

如何编写一个Python脚本，从包含图片和对应txt文件的两个不同文件夹中随机选择指定数量（n）的图片及相应的文本文件，然后将它们分别复制到新的两个文件夹中？

大家在看

基于自适应权重稀疏典范相关分析的人脸表情识别

香港地铁的安全风险管理 (2007年)

彩虹聚合DNS管理系统V1.3+搭建教程

一种新型三维条纹图像滤波算法 图像滤波算法.pdf

节的一些关于非传统-华为hcnp-数通题库2020/1/16（h12-221）v2.5

最新推荐

利用Python实现字幕挂载(把字幕文件与视频合并)思路详解

postgresql-16.6.tar.gz

GitHub Classroom 创建的C语言双链表实验项目解析

管理建模和仿真的文件

【三态RS锁存器CD4043的秘密】：从入门到精通的电路设计指南（附实际应用案例）

霍夫曼四元编码matlab

MATLAB在AWS上的自动化部署与运行指南

"互动学习：行动中的多样性与论文攻读经历"

铁路售票系统用例图：异常流处理的黄金法则

MySQL的jar包拷贝到sqoop/lib下的代码

一种新型三维条纹图像滤波算法图像滤波算法.pdf