首页自定义函数File2Txt抽取pdf和Word文本信息并保存为txt格式。

自定义函数File2Txt抽取pdf和Word文本信息并保存为txt格式。

时间: 2024-05-08 14:19:13 浏览: 5

可以直接回答：这个问题可以通过使用Python的第三方库，比如PyPDF2和python-docx来解决。可以先根据文件类型，使用对应的库读取文本信息，然后将信息保存为txt格式。具体实现可以参考以下代码： ```python import PyPDF2 import docx2txt def File2Txt(file_path): if file_path.endswith('.pdf'): # 使用PyPDF2读取pdf文本信息 pdf_file = open(file_path, 'rb') pdf_reader = PyPDF2.PdfFileReader(pdf_file) text = '' for i in range(pdf_reader.numPages): text += pdf_reader.getPage(i).extractText() pdf_file.close() elif file_path.endswith('.docx'): # 使用docx2txt读取docx文本信息 text = docx2txt.process(file_path) else: raise Exception('文件类型不支持') # 保存为txt文件 with open(file_path + '.txt', 'w', encoding='utf-8') as f: f.write(text) ``` 然后可以通过调用该函数来实现对pdf和docx文件的信息抽取和保存为txt格式。