python爬虫爬取知乎用户信息
时间: 2023-11-08 12:48:20 浏览: 214
可以使用 Python 的 Requests 和 BeautifulSoup 库来爬取知乎用户信息。首先需要登录知乎获取 cookie,然后通过模拟登录获取到用户的个人主页,再使用 BeautifulSoup 解析页面获取用户信息。
以下是示例代码:
```python
import requests
from bs4 import BeautifulSoup
# 登录知乎并获取 cookie
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
login_url = 'https://www.zhihu.com/signin'
response = session.get(login_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
_xsrf = soup.find('input', attrs={'name': '_xsrf'})['value']
captcha_url = soup.find('img', attrs={'class': 'Captcha-englishImg'})['src']
# 模拟登录获取用户信息
login_data = {
'_xsrf': _xsrf,
'email': 'your_account',
'password': 'your_password',
'captcha': input('请输入验证码' + captcha_url),
'remember_me': 'true'
}
session.post(login_url, headers=headers, data=login_data)
user_url = 'https://www.zhihu.com/people/username'
response = session.get(user_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# 解析页面获取用户信息
name = soup.find('span', attrs={'class': 'ProfileHeader-name'}).text
headline = soup.find('span', attrs={'class': 'RichText ztext ProfileHeader-headline'}).text
description = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-description'}).find('span', attrs={'class': 'RichText ztext'}).text.strip()
location = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-location'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
business = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-business'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
employment = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-employment'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
position = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-position'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
education = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-education'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
major = soup.find('div', attrs={'class': 'ProfileHeader-infoItem ProfileHeader-major'}).find('span', attrs={'class': 'ProfileHeader-detailValue'}).text.strip()
```
以上代码中,需要替换 `your_account` 和 `your_password` 为你的知乎登录账号和密码,并将 `username` 替换为你要爬取的用户的用户名。另外,为了防止被知乎反爬虫机制检测到,最好加上一些随机的等待时间和 User-Agent 等信息。
阅读全文