python爬取知乎数据_python集成代码实现八爪鱼爬取知乎的所有功能+外加数据预处理...
时间: 2023-09-20 17:04:45 浏览: 168
基于Python的知乎用户数据爬虫与分析设计源码
以下是Python爬取知乎数据的代码实现,其中使用了八爪鱼爬虫工具和数据预处理库pandas:
```python
import requests
import json
import pandas as pd
from octopus import Octopus
# 设置请求头信息
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299'
}
# 构造请求url
def get_url(keyword, offset):
url = 'https://www.zhihu.com/api/v4/search_v3?t=general&q={}&correction=1&offset={}&limit=20&lc_idx=0&show_all_topics=0&search_hash_id='.format(keyword, offset)
return url
# 爬取知乎数据并保存为json文件
def crawl_data(keyword):
otto = Octopus(
concurrency=8, auto_start=True, expiration_in_seconds=120,
raise_all_exceptions=True, worker_lifespan_in_seconds=120
)
result = []
for i in range(0, 100, 20):
url = get_url(keyword, i)
otto.enqueue(get_data, url)
otto.wait()
for res in otto.results():
result += res
with open('zhihu.json', 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False)
# 解析json文件并使用pandas进行数据预处理
def process_data():
with open('zhihu.json', 'r', encoding='utf-8') as f:
data = json.load(f)
results = []
for item in data:
result = {}
result['问题'] = item['highlight']['title']
result['链接'] = 'https://www.zhihu.com/question/{}'.format(item['object']['question']['id'])
result['答案'] = item['highlight']['description']
results.append(result)
df = pd.DataFrame(results)
df.to_excel('zhihu.xlsx', index=False)
# 获取数据
def get_data(url):
response = requests.get(url, headers=headers)
data = response.json()['data']
result = []
for item in data:
if item['object']['type'] == 'answer':
result.append(item)
return result
if __name__ == '__main__':
crawl_data('Python')
process_data()
```
此代码实现了爬取知乎关于“Python”的数据,并将数据保存成json文件,再使用pandas进行数据预处理,最终将结果保存成Excel文件。你可以根据自己的需求修改关键词以及保存的文件格式。
阅读全文