首页request爬取网页后根据Xpath提取其文本内容

request爬取网页后根据Xpath提取其文本内容

时间: 2024-10-10 22:14:48 浏览: 38

要使用`requests`爬取网页并结合`lxml`的XPath提取文本内容，你可以按照以下步骤操作： 1. 导入所需的库[^1]: ```python import requests from lxml import etree ``` 2. 定义网页URL: ```python url = "https://example.com/news" # 替换成你想要抓取的实际网页地址 ``` 3. 发送GET请求获取网页内容: ```python response = requests.get(url) html_content = response.text ``` 4. 解析HTML内容: ```python html_tree = etree.HTML(html_content) ``` 5. 使用XPath表达式定位特定元素的文本内容: ```python # 假设新闻标题位于类名'news-title'的h2标签内，链接在相同的标签内的href属性 titles_xpath = "//h2[@class='news-title']/text()" links_xpath = "//h2[@class='news-title']/@href" titles = html_tree.xpath(titles_xpath) links = html_tree.xpath(links_xpath) ``` 6. 提取并打印信息: ```python for title, link in zip(titles, links): print(f"Title: {title.strip()}\nLink: {link}\n") # strip()用于移除可能存在的空白字符 ``` 以上代码会抓取指定URL的网页，提取出所有符合XPath表达式的新闻标题及其链接。

阅读全文