题目: 使用selenium和lxml爬取知乎上一个热门问题的标题和回答数,并将结果保存到zhihu.txt文件中。 https://www.zhihu.com/knowledge-plan/hot-question/hot/0/hour
时间: 2024-11-09 15:30:12 浏览: 35
使用Selenium和LXML库爬取知乎上的信息需要分步骤完成。首先,确保已经安装了`selenium`, `requests`, 和 `lxml`库。以下是Python脚本的一个基本示例,它会模拟浏览器访问指定URL,找到问题标题和回答数并将其保存到文本文件:
```python
from selenium import webdriver
import requests
from lxml import etree
import time
# 安装对应的WebDriver,如ChromeDriver
# driver = webdriver.Chrome('path/to/chromedriver')
url = 'https://www.zhihu.com/question/hot/0/hour'
def get_html(url):
# 如果直接使用requests,省略WebDriver
response = requests.get(url)
return response.text
def parse_html(html):
parser = etree.HTMLParser()
tree = etree.fromstring(html, parser=parser)
# 通常在HTML中找元素有特定的XPATH表达式,这里假设标题在<h1>标签内,回答数在某个class名下
title_xpath = '//*[@id="QuestionTitle"]/h1/text()'
answer_count_xpath = '//*[contains(@class, "AnswerCount")]/span[@class="Count"]/text()'
title = tree.xpath(title_xpath)[0]
answer_count = tree.xpath(answer_count_xpath)[0].strip()
return title, int(answer_count)
def save_to_file(title, answer_count):
with open('zhihu.txt', 'a+', newline='', encoding='utf-8') as f:
f.write(f"标题: {title}\n回答数: {answer_count}\n\n")
if __name__ == "__main__":
html = get_html(url)
title, answer_count = parse_html(html)
save_to_file(title, answer_count)
print(f"已成功保存标题'{title}'和回答数'{answer_count}'到zhihu.txt文件。")
# 如果在上述代码中使用Selenium,请记得替换webdriver路径,并确保Selenium服务器运行
# driver.get(url)
# time.sleep(5) # 等待页面加载
# html = driver.page_source
```
注意:由于知乎网页结构可能会变化,上述代码中的XPath选择器可能需要调整以匹配当前页面布局。同时,知乎网站通常有反爬虫机制,频繁抓取可能需要设置User-Agent、延迟请求等策略,甚至可能触发验证码。
阅读全文