request爬取网页后根据Xpath提取其文本内容
时间: 2024-10-10 22:14:48 浏览: 38
要使用`requests`爬取网页并结合`lxml`的XPath提取文本内容,你可以按照以下步骤操作:
1. 导入所需的库[^1]:
```python
import requests
from lxml import etree
```
2. 定义网页URL:
```python
url = "https://example.com/news" # 替换成你想要抓取的实际网页地址
```
3. 发送GET请求获取网页内容:
```python
response = requests.get(url)
html_content = response.text
```
4. 解析HTML内容:
```python
html_tree = etree.HTML(html_content)
```
5. 使用XPath表达式定位特定元素的文本内容:
```python
# 假设新闻标题位于类名'news-title'的h2标签内,链接在相同的标签内的href属性
titles_xpath = "//h2[@class='news-title']/text()"
links_xpath = "//h2[@class='news-title']/@href"
titles = html_tree.xpath(titles_xpath)
links = html_tree.xpath(links_xpath)
```
6. 提取并打印信息:
```python
for title, link in zip(titles, links):
print(f"Title: {title.strip()}\nLink: {link}\n") # strip()用于移除可能存在的空白字符
```
以上代码会抓取指定URL的网页,提取出所有符合XPath表达式的新闻标题及其链接。
阅读全文