首页python 爬虫爬取 <br>便签后无标签包裹的文字

python 爬虫爬取 <br>便签后无标签包裹的文字

时间: 2023-11-11 17:03:21 浏览: 137

在 Python 中，可以使用第三方库 `beautifulsoup4` 来解析 HTML 页面，获取其中的文本内容。具体来说，可以使用 `BeautifulSoup` 类的 `get_text()` 方法来获取页面中的纯文本内容。以下是一个示例代码，假设需要爬取一个页面中的 `<br>` 标签后的无标签包裹的文字： ```python import requests from bs4 import BeautifulSoup # 发送 GET 请求获取页面内容 response = requests.get('https://example.com/page') # 解析页面内容，获取所有 <br> 标签后的文本 soup = BeautifulSoup(response.text, 'html.parser') br_tags = soup.find_all('br') for br_tag in br_tags: next_node = br_tag.next_sibling while next_node is not None and next_node.name == 'br': next_node = next_node.next_sibling if next_node is not None: print(next_node.get_text()) ``` 以上代码中，首先使用 `requests.get()` 方法发送 GET 请求，获取页面内容。接着使用 `BeautifulSoup` 类将页面内容进行解析，并使用 `find_all()` 方法获取所有的 `<br>` 标签。然后对于每个 `<br>` 标签，使用 `next_sibling` 属性获取其下一个兄弟节点，判断节点是否为 `<br>` 标签，如果是则继续获取其下一个兄弟节点，直到遇到一个非 `<br>` 标签的节点。最后使用 `get_text()` 方法获取该节点的纯文本内容。

阅读全文