提取talk_id、时间（时间转化为‘0000-00-00’格式）、发送ID、接收ID、信息类型（如果是向设置为私聊，如果是在群设置为群聊）、发送内容（如果是语音或图片设置默认值空值，反之则提取信息内容）、文件名（如果是图片、语音提取<a href=后面的链接地址）

要提取 `talk_id`、时间、发送ID、接ID、信息类型、发送内容和文件，您可以使用正则表达式和 `BeautifulSoup` 来解析 HTML 代码并提取所需的数据。以下是后的代码示例： ```python import os import re from bs4 import BeautifulSoup # 指定文件夹路径 folder_path = "C:/Users/test/Desktop/DIDItest" # 正则表达式模式 talkid_pattern = r'\[talkid:(\d+)\]' time_pattern = r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]' sender_pattern = r'\[sender:(.*?)\]' receiver_pattern = r'\[receiver:(.*?)\]' type_pattern = r'\[type:(.*?)\]' # 遍历文件夹中的所有文件 for root, dirs, files in os.walk(folder_path): for file in files: # 读取html文件 file_path = os.path.join(root, file) with open(file_path, "r", encoding="utf-8") as f: html_code = f.read() # 解析html代码 soup = BeautifulSoup(html_code, "html.parser") # 提取talk_id talk_id = re.findall(talkid_pattern, html_code) # 提取时间并转换格式 time = re.findall(time_pattern, html_code) time = [t.split()[0] for t in time] # 转化为 '0000-00-00' 格式 # 提取发送ID sender = re.findall(sender_pattern, html_code) # 提取接收ID receiver = re.findall(receiver_pattern, html_code) # 提取信息类型 info_type = re.findall(type_pattern, html_code) info_type = ["私聊" if t == "setting" else "群聊" for t in info_type] # 提取发送内容和文件名 content = [] file_name = [] messages = soup.find_all("div", class_="message") for message in messages: if message.find("a"): # 包含链接，文件名为<a href>标签内的内容 file_link = message.find("a").get("href") file_name.append(file_link.split("/")[-1]) content.append("") # 文件类型，内容为空字符串 else: # 文本类型，提取内容 content.append(message.text.strip()) file_name.append("") # 非文件类型，文件名为空字符串 # 打印提取的数据 for i in range(len(talk_id)): print("talk_id:", talk_id[i]) print("时间:", time[i]) print("发送ID:", sender[i]) print("接收ID:", receiver[i]) print("信息类型:", info_type[i]) print("发送内容:", content[i]) print("文件名:", file_name[i]) print() ``` 这段代码使用了多个正则表达式模式来匹配 `[talkid:数字]`、`[时间]`、`[sender:...]`、`[receiver:...]` 和 `[type:...]` 格式的字符串。然后，使用 `BeautifulSoup` 解析 HTML 代码，并使用正则表达式和 `find_all()` 方法找到所需的数据。最后，打印提取的数据。请确保您已经安装了 `re` 和 `BeautifulSoup` 库，可以使用以下命令来安装： ``` pip install beautifulsoup4 ```

相关推荐

weibo-crawler-master_talk3z9_weibo-crawler_微博id_girl5j1_python_源

fastlane-plugin-ding_talk_msg_push:发送钉钉机器人消息

tcp_wrappers_7.6-ipv6.4.tar.gz_Talk Talk_tcp_wrappers-7.6_tcp_wr

提取所有talk_id

从body_data中提取Talkid

仅提取包含talkid的数据

cursor.execute('''CREATE TABLE IF NOT EXISTS chat_data (talk_id INT PRIMARY KEY, time TEXT, send_id INT, receive_id INT, talk_type TEXT, content TEXT)''')

将日期格式转换为xxxx-xx-xx xx:xx:xx格式

Traceback (most recent call last): File "C:\Users\test\PycharmProjects\pythonProject\DIDI数据写入CSV.py", line 38, in <module> print("talkid:", talk_id) ^^^^^^^ NameError: name 'talk_id' is not defined

Traceback (most recent call last): File "C:\Users\test\PycharmProjects\pythonProject\DIDI数据写入CSV.py", line 82, in <module> writer.writerow([talk_id, time, send_id, receive_id, talk_type, content]) ^^^^^^^ NameError: name 'talk_id' is not defined

Traceback (most recent call last): File "C:\Users\test\PycharmProjects\pythonProject\DIDI数据写入CSV.py", line 63, in <module> writer.writerow([talk_id, time, send_id, receive_id, talk_type]) ValueError: I/O operation on closed file.

最新推荐

grpcio-1.47.0-cp310-cp310-linux_armv7l.whl

小程序项目源码-美容预约小程序.zip

zigbee-cluster-library-specification

管理建模和仿真的文件

【实战演练】MATLAB用遗传算法改进粒子群GA-PSO算法

openstack的20种接口有哪些

JSBSim Reference Manual

"互动学习：行动中的多样性与论文攻读经历"

【实战演练】时间序列预测用于个体家庭功率预测_ARIMA, xgboost, RNN

怎么在集群安装安装hbase