python代码实现:爬取1000名微博用户信息,包括用户发表的微博内容、关注对象、粉丝等,对爬取内容清洗并提取用户特征,对1000名用户根据爬取的特征进行聚类分析,构建每个类下的用户画像(可通过词云图进行展示),并针对近日的微博热点进行话题推荐。
时间: 2024-04-20 09:26:23 浏览: 171
基于Python的新浪微博用户数据采集与分析
以下是一个实现该功能的Python代码示例:
```python
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 爬取用户信息的函数
def crawl_user_info(user_id):
url = f'https://weibo.com/u/{user_id}'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 解析用户发表的微博内容、关注对象、粉丝等信息
weibo_content = [weibo.text for weibo in soup.select('.WB_text')]
followings = [following.text for following in soup.select('.follow_item_name')]
followers = [follower.text for follower in soup.select('.follow_item_name')]
return {
'user_id': user_id,
'weibo_content': weibo_content,
'followings': followings,
'followers': followers
}
# 爬取1000名用户信息
user_info_list = []
for user_id in range(1, 1001):
user_info = crawl_user_info(user_id)
user_info_list.append(user_info)
# 清洗并提取用户特征
corpus = []
for user_info in user_info_list:
weibo_content = ' '.join(user_info['weibo_content'])
corpus.append(weibo_content)
# 使用TF-IDF向量化文本
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# 聚类分析
n_clusters = 5 # 设置聚类簇数
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)
# 构建用户画像
user_profiles = [[] for _ in range(n_clusters)]
for i, user_info in enumerate(user_info_list):
cluster_label = kmeans.labels_[i]
user_profiles[cluster_label].append(user_info)
# 生成词云图
for i, user_profile in enumerate(user_profiles):
text = ' '.join([weibo_content for user_info in user_profile for weibo_content in user_info['weibo_content']])
wordcloud = WordCloud(background_color='white').generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title(f'Cluster {i+1} User Profile')
plt.show()
# 进行话题推荐
hot_topics = ['topic1', 'topic2', 'topic3'] # 假设这是近日的微博热点话题
for i, user_profile in enumerate(user_profiles):
topic_scores = []
for user_info in user_profile:
weibo_content = ' '.join(user_info['weibo_content'])
score = sum(weibo_content.count(topic) for topic in hot_topics)
topic_scores.append(score)
top_user_indices = sorted(range(len(topic_scores)), key=lambda k: topic_scores[k], reverse=True)[:5]
top_users = [user_profile[index] for index in top_user_indices]
print(f'Top users in Cluster {i+1}:')
for user_info in top_users:
print(f"User ID: {user_info['user_id']}, Weibo Content: {user_info['weibo_content'][:10]}")
```
请注意,该代码示例仅为演示目的,实际应用中可能需要添加异常处理、登录微博账号、使用代理等功能以提高爬取稳定性和效率。此外,针对用户画像和话题推荐的处理方式可以根据具体需求进行调整和优化。
阅读全文