在东方股吧下载“长久物流吧”(https://guba.eastmoney.com/list,603569.html)所有发帖数据,提取出帖子作者,发帖时间,阅读量,评论数,帖子标题,帖子链接,并将结果输出至文本文件“data_guba_cjwl.txt”,数据为2011-01-01至今。利用python构建股吧信息量指标(根据发帖时间、阅读量、评论数、帖子标题自行设计指标),将该指标作为预测因子,检验其对长久物流的超额收益率是否具有可预测性。
时间: 2024-02-23 22:57:22 浏览: 131
Guba_Xueqiu_Crawler:搜寻来自guba.eastmoney.com和xueqiu.com的帖子
以下是实现该任务的代码:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
# 定义函数,用于爬取股吧数据
def get_guba_data(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
data_list = []
for item in soup.select('div.articleh'):
title = item.find('a').text.strip()
link = 'https://guba.eastmoney.com' + item.find('a')['href']
author = item.select('.l3 a')[0].text.strip()
time = item.select('.l6')[0].text.strip()
read = item.select('.l1')[0].text.strip()
comment = item.select('.l2')[0].text.strip()
data_list.append([author, time, read, comment, title, link])
return data_list
# 获取所有页面数据
url = 'https://guba.eastmoney.com/list,603569,f.html'
data_list = []
for i in range(1, 10000):
page_url = url.replace('.html', ',default,' + str(i) + '.html')
print('正在爬取第%d页数据...' % i)
page_data = get_guba_data(page_url)
if len(page_data) == 0:
break
data_list.extend(page_data)
# 转换为DataFrame
df = pd.DataFrame(data_list, columns=['author', 'time', 'read', 'comment', 'title', 'link'])
# 转换时间格式
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
# 保存数据到文件
df.to_csv('data_guba_cjwl.txt', index=False, sep='\t')
# 计算信息量指标
df['time_diff'] = df['time'].diff().fillna(0)
df['time_diff'] = df['time_diff'].apply(lambda x: x.seconds)
df['read_index'] = df['read'] / df['time_diff']
df['comment_index'] = df['comment'] / df['time_diff']
df['title_length'] = df['title'].apply(lambda x: len(x))
df['info_index'] = df['read_index'] * df['comment_index'] * df['title_length']
# 读取长久物流股票收盘价数据
df_price = pd.read_csv('cjwl_price.csv')
df_price['date'] = pd.to_datetime(df_price['date'], format='%Y-%m-%d')
# 合并数据
df_merge = pd.merge(df, df_price, left_on='time', right_on='date')
# 计算日收益率和超额收益率
df_merge['daily_return'] = df_merge['close'].pct_change()
df_merge['excess_return'] = df_merge['daily_return'] - df_merge['daily_return'].mean()
# 计算相关系数
corr = df_merge['excess_return'].corr(df_merge['info_index'])
print('信息量指标与超额收益率的相关系数为:%.4f' % corr)
```
以上代码中,我们首先定义了一个函数 `get_guba_data()` 用于爬取股吧数据,然后使用循环获取所有页面的数据,并将其转换为 DataFrame 格式并保存到文件中。接着,我们计算了信息量指标并将其作为预测因子,计算了长久物流的超额收益率,并计算了两者之间的相关系数。
阅读全文