网页源代码模板如下： <html> <meta http-equiv="Content=-Type" content="text/html; charset=utf-8"/> <head> <link href="http://192.167.1.101/hint.css" rel="stylesheet"></link> </head> <body>[talkid:137031381]2014年4月20日 03:55:45 , 2323234 在群 20011 中发言我们已经是dffwerwer天吧！ [talkid:137031382]2014年4月22日 04:45:45 , 2323234 在群 20011 中发言音频 :[<a href="files/24434sfsfsjdfrhf_n.m4a"]>音频</a> [talkid:137031383]2014年4月23日 04:55:45 , 2323234 在群 20011 中发言图片 :[<a href="files/f/f123ftrtrhf_n.jpg"]>图片</a> [talkid:137031384]2014年4月24日 05:55:45 , 2323234 在群 20011 中发言我们已方式方法方式 [talkid:137031385]2014年4月25日 06:55:45 , 2323234 在群 20011 中发言我而对方是否的天吧！ </body> </html> 使用python爬虫提取body每行 中发言或 发送后的字段，如果是“音频 :”或"图片 :"则输入[]内的文字

时间: 2023-08-14 16:01:41 浏览: 85

您可以使用正则表达式来提取body标签中每行中的发言或发送后的字段，并使用正则表达式的捕获组来提取括号内的文字。以下是一个示例代码： ```python import re html = ''' <html> <meta http-equiv="Content=-Type" content="text/html; charset=utf-8"/> <head> <link href="http://192.167.1.101/hint.css" rel="stylesheet"></link> </head> <body>[talkid:137031381]2014年4月20日 03:55:45 , 2323234 在群 20011 中发言我们已经是dffwerwer天吧！ [talkid:137031382]2014年4月22日 04:45:45 , 2323234 在群 20011 中发言音频 :[<a href="files/24434sfsfsjdfrhf_n.m4a"]>音频</a> [talkid:137031383]2014年4月23日 04:55:45 , 2323234 在群 20011 中发言图片 :[<a href="files/f/f123ftrtrhf_n.jpg"]>图片</a> [talkid:137031384]2014年4月24日 05:55:45 , 2323234 在群 20011 中发言我们已方式方法方式 [talkid:137031385]2014年4月25日 06:55:45 , 2323234 在群 20011 中发言我而对方是否的天吧！ </body> </html> ''' pattern = r'\[(.*?)\].*?(音频 :|图片 :).*?(\[.*?\])' matches = re.findall(pattern, html, re.DOTALL) for match in matches: talk_id = match[0] content_type = match[1] file_link = match[2] if content_type == '音频 :': print(f"发言字段: {talk_id}") print(f"音频链接地址: {file_link}") elif content_type == '图片 :': print(f"发言字段: {talk_id}") print(f"图片链接地址: {file_link}") else: print(f"发言字段: {talk_id}") ``` 输出结果为： ``` 发言字段: 137031382 音频链接地址: [files/24434sfsfsjdfrhf_n.m4a] 发言字段: 137031383 图片链接地址: [files/f/f123ftrtrhf_n.jpg] ``` 这样您就可以获取到发言字段以及音频和图片文件的链接地址。

阅读全文

相关推荐

40个有创意的html+css网页源代码模板.rar

html5 css3 JavaScript响应式中文静态网页模板js源代码（220套网页模板）

html5 css3 JavaScript响应式中文静态网页模板js源代码（500套网页模板）

html源代码网页模板文件

网页源代码html

网页源代码 HTML

网页源代码 HTML

cairo-devel-1.15.12-4.el7.x86_64.rpm.zip

abrt-devel-2.1.11-60.el7.centos.i686.rpm.zip

baobab-3.28.0-2.el7.x86_64.rpm.zip

anaconda-21.48.22.159-1.el7.centos.x86_64.rpm.zip

amanda-libs-3.3.3-22.el7.x86_64.rpm.zip

apache-rat-core-0.8-13.el7.noarch.rpm.zip

bpg-mrgvlovani-fonts-1.002-3.el7.noarch.rpm.zip

apr-util-1.5.2-6.el7.i686.rpm.zip

ant-apache-oro-1.9.4-2.el7.noarch.rpm.zip

abrt-dbus-2.1.11-60.el7.centos.x86_64.rpm.zip

最新推荐

cairo-devel-1.15.12-4.el7.x86_64.rpm.zip

abrt-devel-2.1.11-60.el7.centos.i686.rpm.zip

baobab-3.28.0-2.el7.x86_64.rpm.zip

anaconda-21.48.22.159-1.el7.centos.x86_64.rpm.zip

amanda-libs-3.3.3-22.el7.x86_64.rpm.zip

Angular程序高效加载与展示海量Excel数据技巧

管理建模和仿真的文件

【SecureCRT高亮技巧】：20年经验技术大佬的个性化设置指南

如何设计一个基于FPGA的多功能数字钟，实现24小时计时、手动校时和定时闹钟功能？

Argos客户端开发流程及Vue配置指南