bs4 find_all 排除某个

标签如果想要排除某个标签，可以使用bs4的CSS选择器语法中的:not()伪类。下面是一个例子，假设我们要查找所有的p标签，但是不包括class为"exclude"的p标签： ```python from bs4 import BeautifulSoup html = """ <html> <body> 这是一个普通的p标签 这个p标签将被排除 又是一个普通的p标签 </body> </html> """ soup = BeautifulSoup(html, 'html.parser') p_tags = soup.select('p:not(.exclude)') for p in p_tags: print(p.text) ``` 输出： ``` 这是一个普通的p标签又是一个普通的p标签 ``` 在CSS选择器语法中，:not()伪类可以接受一个选择器作为参数，表示排除符合该选择器的元素。在上面的例子中，我们使用了".exclude"选择器作为:not()的参数，表示排除class为"exclude"的元素。

Traceback (most recent call last): File "C:\Users\test\Desktop\DIDI测试.py", line 55, in <module> extract_html_info(file_path, csv_writer) File "C:\Users\test\Desktop\DIDI测试.py", line 25, in extract_html_info timestamp = p_tag.find_previous('body').find_previous('head').find('meta', {'http-equiv': 'Content=-Type'})[ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: 'NoneType' object is not subscriptable

这个错误是因为在提取时间戳时，找不到对应的元素导致的。可能是由于HTML结构的变化或者标签属性的不一致导致无法成功提取。你可以尝试修改代码以适应不同的HTML结构。以下是修改后的代码： ```python import os import re import csv from bs4 import BeautifulSoup # 设置html文件路径 folder_path = r'C:\Users\test\Desktop\DIDItest' output_file = r'C:\Users\test\Desktop\output.csv' # 提取html文件内所需要数据 def extract_html_info(file_path, csv_writer): with open(file_path, 'r', encoding='utf-8') as file: # 读取HTML源代码 html = file.read() soup = BeautifulSoup(html, 'html.parser') # 提取所有的标签 p_tags = soup.find_all('p') for p_tag in p_tags: # 提取ID talk_id = p_tag.find_previous(string=lambda text: isinstance(text, str) and '[talkid:' in text) talk_id = talk_id.strip('[talkid:]') # 提取时间 timestamp = '' meta_tags = soup.find_all('meta', {'http-equiv': 'Content-Type'}) for meta_tag in meta_tags: if 'charset' not in meta_tag['content']: timestamp = meta_tag['content'] break # 提取发送号码 send_number = p_tag.find('span', {'class': 'hint-success'}).text # 提取接收号码 receive_number = p_tag.find_all('span', {'class': 'hint-success'})[1].text # 提取信息内容 message = p_tag.previous_sibling.strip() # 判断是否是音频 if '音频' in message: file_url = p_tag.find('a')['href'] csv_writer.writerow([talk_id, timestamp, send_number, receive_number, file_url]) else: csv_writer.writerow([talk_id, timestamp, send_number, receive_number, message]) # 创建CSV文件并写入数据 with open(output_file, 'w', newline='', encoding='utf-8') as file: csv_writer = csv.writer(file) csv_writer.writerow(['ID', '时间', '发送号码', '接收号码', '内容']) # 遍历文件夹及子文件夹，提取HTML文件信息 for root, dirs, files in os.walk(folder_path): for file in files: if file.endswith('.html'): file_path = os.path.join(root, file) extract_html_info(file_path, csv_writer) print("数据已成功写入CSV文件。") ``` 在这个修改后的代码中，我修改了提取时间戳的部分。我使用了一个循环来遍历所有的`meta`标签，并通过判断`content`属性中是否包含`charset`来排除不需要的标签。找到合适的标签后，将其`content`属性赋值给`timestamp`变量。希望这次能够顺利运行并正确提取时间戳！如果你还有其他问题，请随时提问。

import requests from bs4 import BeautifulSoup import csv # 发送HTTP请求获取网页内容 url = "https://item.jd.com/100016034394.html" # 替换为目标商品评价页面的URL response = requests.get(url) html = response.text # 使用BeautifulSoup解析网页内容 soup = BeautifulSoup(html, "html.parser") # 获取所有评价信息 comments = soup.find_all("div", class_="comment-item") # 创建CSV文件并写入数据 filename = "商品评价.csv" with open(filename, "w", encoding="utf-8-sig", newline="") as csvfile: writer = csv.writer(csvfile) writer.writerow(["用户", "评分", "评价内容"]) # 写入表头 # 写入每条评价的数据 for comment in comments: user = comment.find("div", class_="user-info").find("span").text.strip() rating = comment.find("div", class_="comment-star").find("i")["class"][1].split("-")[1] content = comment.find("div", class_="comment-con").find("p").text.strip() writer.writerow([user, rating, content]) print(f"成功爬取并保存为 {filename}") HTTPSConnectionPool(host='www.jd.com', port=443): Max retries exceeded with url: /?from=pc_item_sd (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000015FCC87E7C0>, 'Connection to www.jd.com timed out. (connect timeout=10)'))

这个错误是连接超时错误，可能是因为请求的网站响应时间过长或网络连接不稳定导致的。你可以尝试以下解决方案： 1. 增加请求超时时间：在发送请求时，你可以指定一个更长的超时时间来等待服务器响应。将`requests.get()`函数中的`timeout`参数设置为一个适当的值，例如`timeout=30`，表示等待30秒钟来建立连接和获取响应。 ```python response = requests.get(url, timeout=30) ``` 2. 检查网络连接和目标网站的可访问性：确保你的网络连接正常，并且能够正常访问目标网站。你可以尝试在浏览器中打开目标网址以检查是否能够成功加载页面。 3. 使用代理服务器：如果你的网络环境有限制或目标网站对爬虫有限制，可以尝试使用代理服务器来发送请求。通过设置`requests.get()`函数中的`proxies`参数，指定代理服务器的地址和端口。 ```python proxies = { 'http': 'http://proxy.example.com:8080', 'https': 'http://proxy.example.com:8080', } response = requests.get(url, proxies=proxies) ``` 4. 尝试其他网络环境：如果以上方法都无效，你可以尝试在其他网络环境下运行代码，例如使用不同的网络连接或切换到其他网络环境，以排除网络问题。根据具体情况选择适合的解决方案，以确保能够成功建立连接并爬取目标网页的数据。

bs4 find_all 排除某个

相关推荐

python 如何使用find和find_all爬虫、找文本的实现

bs4_beautifulsoup4.zip

BS.rar_BS_mis系统 java

python 在import bs4的情况下，完成函数tableRowCounter以HTML字符串参数s作为输入，并返回s的第一个表中的行数。不应计算包含表头的行。

用python写一个函数，要求如下： In the code cell below, complete function tableRowCounter take the parameter s which is a string of HTML as input and returns the number of rows in the first table of s. The row that contains the table header should not be counted.th中的tr不参与计算

写代码爬取这个网站http://www.stats.gov.cn/sj/pcsj/rkpc/6rp/html/A0101a.htm

【Python爬虫】东方财富数据爬取每日活跃营业部数据

爬取软科大学排名前100

span套span 只爬取去外层span的内容

用python爬虫链接https://www.shanghairanking.cn/rankings/bcur/2023 返回前30名大学

、在古诗文网找到《登鹳雀楼》，通过编写函数来获取这首诗的名称、作者、诗正文等内容

BS4_BeautifulSoup.docx

BS.rar_bs远控_远控_远控 VC

BS.rar_BS_bs 三层 结构

BS.rar_BS_VB 锁_锁定电脑

crypto1_bs-master_crypto1_bs_mifare_crypto1_

html.rar_BS_BS 管理系统_食堂管理

最新推荐

Java开发案例-springboot-19-校验表单重复提交-源代码+文档.rar

基于android的公司员工考勤综合信息平台源码.zip

珍藏很久的一套源码升级了很多

附件二六个指标与权重得分 - 副本.xlsx

自学助词自学助词自学助词

zigbee-cluster-library-specification

管理建模和仿真的文件

MATLAB柱状图在信号处理中的应用：可视化信号特征和频谱分析

HSV转为RGB的计算公式

JSBSim Reference Manual

BS.rar_BS_bs 三层结构