首页Python利用xpath爬取百度新闻

Python利用xpath爬取百度新闻

时间: 2024-09-30 11:00:56 浏览: 102

Python通过`lxml`或`BeautifulSoup`等库可以方便地利用XPath语法来爬取网页内容，包括百度新闻。以下是简单的步骤： 1. 安装必要的库：首先确保已安装了`requests`用于发送HTTP请求，以及`lxml`或`beautifulsoup4`其中之一来解析HTML文档。 ```bash pip install requests lxml # 或者 pip install beautifulsoup4 ``` 2. 发送请求获取HTML：使用`requests.get()`函数获取包含百度新闻的网页源代码。 ```python import requests url = 'https://news.baidu.com/' # 百度新闻首页URL response = requests.get(url) html_content = response.text ``` 3. 使用XPath解析HTML：使用`lxml`的`ElementTree`或`BeautifulSoup`的`.select()`方法，传入XPath表达式来选择需要的数据节点。 ```python from lxml import etree # 使用lxml html.etree.fromstring(html_content).xpath('//div[@class="news-item"]') # 或者使用BeautifulSoup soup = BeautifulSoup(html_content, 'lxml') soup.select('.news-item') # 或者 'ul.news-list li' ``` 4. 提取数据：对选中的节点进一步提取所需的信息，如标题、链接、发布时间等。 ```python for item in soup.select('.news-item'): title = item.find('a', class_='title').text link = item.find('a', class_='title')['href'] pub_date = item.find('span', class_='time').text print(f"标题: {title}, 链接: {link}, 发布时间: {pub_date}") ``` 注意，网站可能会有反爬虫策略，频繁抓取可能违反服务条款，所以在实际操作中应尊重并遵守相关规定。

阅读全文