用Python在QQ音乐中爬取周杰伦歌曲的评论,并绘制词云图
时间: 2024-05-14 22:12:03 浏览: 176
由于QQ音乐的反爬虫机制比较强,需要先模拟登录QQ音乐获取cookies,然后再进行爬取评论和生成词云图的操作。
以下是完整代码:
```python
import requests
import json
import time
import os
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
# 登录QQ音乐获取cookies
def get_cookies():
headers = {
'Referer': 'https://y.qq.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
url = 'https://y.qq.com/'
response = requests.get(url, headers=headers)
cookies = response.cookies.get_dict()
return cookies
# 爬取评论
def get_comments(song_id, page):
headers = {
'Referer': 'https://y.qq.com/n/yqq/song/{}.html'.format(song_id),
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Cookie': 'pgv_pvid=6021975380; pgv_pvi=3306170880; pt2gguin=o0533414728; RK=J0VJyXs+Ld; ptcz=4e11a4a6e4a6b8d37b42a6b9f9d2b6c8a6b2a6b9f9d2b6c8a6b2a6b9f9d2b6c8; pgv_si=s1046426624; pgv_info=ssid=s7230811040; _qpsvr_localtk=0.7601771490547045; yq_index=0; yq_playschange=0; yq_playdata=; ts_uid=4789989478; player_exist=1; qqmusic_fromtag=66'
}
url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg'
params = {
'g_tk': '5381',
'loginUin': '0',
'hostUin': '0',
'format': 'json',
'inCharset': 'utf8',
'outCharset': 'utf-8',
'notice': '0',
'platform': 'yqq.json',
'needNewCode': '0',
'cid': '205360772',
'reqtype': '2',
'biztype': '1',
'topid': song_id,
'cmd': '8',
'pagenum': page,
'pagesize': '25',
'lasthotcommentid': '',
'domain': 'qq.com',
'ct': '24',
'cv': '10101010'
}
response = requests.get(url, headers=headers, params=params)
json_data = json.loads(response.text)
comment_list = json_data['comment']['commentlist']
comments = []
for comment in comment_list:
content = comment['rootcommentcontent']
comments.append(content)
return comments
# 生成词云图
def generate_wordcloud(text, mask_path):
# 读取遮罩图片
mask = np.array(Image.open(mask_path))
# 设置词云图参数
wc = WordCloud(background_color="white", max_words=2000, mask=mask, contour_width=1, contour_color='steelblue')
# 生成词云图
wc.generate(text)
# 显示词云图
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
if __name__ == '__main__':
# 歌曲id
song_id = '108119'
# 存储评论的文件名
filename = 'comments.txt'
# 遮罩图片路径
mask_path = 'jay.jpg'
# 获取cookies
cookies = get_cookies()
# 爬取评论
comments = []
for i in range(1, 11):
comments += get_comments(song_id, i)
time.sleep(1)
# 保存评论到文件中
with open(filename, 'w', encoding='utf-8') as f:
for comment in comments:
f.write(comment + '\n')
# 读取评论文件
with open(filename, 'r', encoding='utf-8') as f:
text = f.read()
# 生成词云图
generate_wordcloud(text, mask_path)
```
需要注意的几点:
1. `get_cookies()`函数中的`Cookie`参数需要根据自己的账号进行修改,可以通过浏览器获取;
2. `get_comments()`函数中的`cid`参数是QQ音乐评论的分类id,每个歌曲的分类id不同,需要通过浏览器获取;
3. `generate_wordcloud()`函数中的`mask_path`参数是用于遮罩的图片路径,需要提前准备好;
4. 由于QQ音乐的反爬虫机制比较强,为了避免被封IP,需要在爬取评论时加入适当的延时。
阅读全文