自行构建Dataframe
时间: 2023-07-16 19:16:27 浏览: 94
好的,我可以给您介绍如何使用 Python 中的 Pandas 库构建 Dataframe。
首先,需要先导入 Pandas 库:
```python
import pandas as pd
```
然后,可以通过不同的方式来构建 Dataframe,例如:
1. 从列表中构建 Dataframe:
```python
data = [['Alice', 25], ['Bob', 30], ['Cathy', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
```
2. 从字典中构建 Dataframe:
```python
data = {'Name': ['Alice', 'Bob', 'Cathy'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
```
3. 从 CSV 文件中读取数据:
```python
df = pd.read_csv('data.csv')
```
以上是一些常见的构建 Dataframe 的方式,您可以根据具体需求选择适合自己的方式。同时,Pandas 还提供了丰富的数据处理和分析工具,可以帮助您更好地处理和分析数据。
相关问题
在东方股吧下载“长久物流吧”(https://guba.eastmoney.com/list,603569.html)所有发帖数据,提取出帖子作者,发帖时间,阅读量,评论数,帖子标题,帖子链接,并将结果输出至文本文件“data_guba_cjwl.txt”,数据为2011-01-01至今。利用python构建股吧信息量指标(根据发帖时间、阅读量、评论数、帖子标题自行设计指标),将该指标作为预测因子,检验其对长久物流的超额收益率是否具有可预测性。
以下是实现该任务的代码:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
# 定义函数,用于爬取股吧数据
def get_guba_data(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
data_list = []
for item in soup.select('div.articleh'):
title = item.find('a').text.strip()
link = 'https://guba.eastmoney.com' + item.find('a')['href']
author = item.select('.l3 a')[0].text.strip()
time = item.select('.l6')[0].text.strip()
read = item.select('.l1')[0].text.strip()
comment = item.select('.l2')[0].text.strip()
data_list.append([author, time, read, comment, title, link])
return data_list
# 获取所有页面数据
url = 'https://guba.eastmoney.com/list,603569,f.html'
data_list = []
for i in range(1, 10000):
page_url = url.replace('.html', ',default,' + str(i) + '.html')
print('正在爬取第%d页数据...' % i)
page_data = get_guba_data(page_url)
if len(page_data) == 0:
break
data_list.extend(page_data)
# 转换为DataFrame
df = pd.DataFrame(data_list, columns=['author', 'time', 'read', 'comment', 'title', 'link'])
# 转换时间格式
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
# 保存数据到文件
df.to_csv('data_guba_cjwl.txt', index=False, sep='\t')
# 计算信息量指标
df['time_diff'] = df['time'].diff().fillna(0)
df['time_diff'] = df['time_diff'].apply(lambda x: x.seconds)
df['read_index'] = df['read'] / df['time_diff']
df['comment_index'] = df['comment'] / df['time_diff']
df['title_length'] = df['title'].apply(lambda x: len(x))
df['info_index'] = df['read_index'] * df['comment_index'] * df['title_length']
# 读取长久物流股票收盘价数据
df_price = pd.read_csv('cjwl_price.csv')
df_price['date'] = pd.to_datetime(df_price['date'], format='%Y-%m-%d')
# 合并数据
df_merge = pd.merge(df, df_price, left_on='time', right_on='date')
# 计算日收益率和超额收益率
df_merge['daily_return'] = df_merge['close'].pct_change()
df_merge['excess_return'] = df_merge['daily_return'] - df_merge['daily_return'].mean()
# 计算相关系数
corr = df_merge['excess_return'].corr(df_merge['info_index'])
print('信息量指标与超额收益率的相关系数为:%.4f' % corr)
```
以上代码中,我们首先定义了一个函数 `get_guba_data()` 用于爬取股吧数据,然后使用循环获取所有页面的数据,并将其转换为 DataFrame 格式并保存到文件中。接着,我们计算了信息量指标并将其作为预测因子,计算了长久物流的超额收益率,并计算了两者之间的相关系数。
python操作resultym.csv数据表(有Date(YYYY/MM)、TotalPrice两列数据),数据表第一行为表头信息,数据表中前27行都有数据,以此为基础,python调用resultym.csv表进行操作:循环调用以resultym.csv为数据集构建的pytorch lstm预测模型(模型实现过程:读取csv表,然后将TotalPrice归一化,接着按照0.8划分训练集和测试集,然后将划分好的数据转为PyTorch张量,之后定义超参数和算法模型、优化器,最后训练模型),该模型能够根据Date值来预测TotalPrice值,然后将第一次预测出的y_test_pred赋值给B26、将第二次预测出的值赋给B27、将第三次预测出的值赋给B28,一直循环直到求出B50的数值。每预测出一个值就在表的最后一行插入一组数据,插入的数据为:Date插入的值按照前面的年月往下延(即按照2023/03、2023/04、2023/05········2025/01的顺序),TotalPrice插入的值定义为2222222.5。直到求出第50行的数值,脚本停止运行。
以下是实现该功能的代码:
```python
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
# 读取csv表
df = pd.read_csv('resultym.csv')
# 归一化TotalPrice列
df['TotalPrice'] = (df['TotalPrice'] - df['TotalPrice'].min()) / (df['TotalPrice'].max() - df['TotalPrice'].min())
# 划分训练集和测试集
train_size = int(len(df) * 0.8)
train_df = df[:train_size]
test_df = df[train_size:]
# 转为PyTorch张量
train_x = torch.tensor(train_df['TotalPrice'].values).view(-1, 1, 1).float()
train_y = torch.tensor(train_df['TotalPrice'].values).view(-1, 1, 1).float()
test_x = torch.tensor(test_df['TotalPrice'].values).view(-1, 1, 1).float()
test_y = torch.tensor(test_df['TotalPrice'].values).view(-1, 1, 1).float()
# 定义超参数和算法模型、优化器
input_size = 1
output_size = 1
hidden_size = 32
num_layers = 2
learning_rate = 0.01
num_epochs = 100
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(LSTM, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).requires_grad_()
c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).requires_grad_()
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
out = self.fc(out[:, -1, :])
return out
model = LSTM(input_size, hidden_size, num_layers, output_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 训练模型
for epoch in range(num_epochs):
outputs = model(train_x)
optimizer.zero_grad()
loss = criterion(outputs, train_y)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print("Epoch: %d, loss: %1.5f" % (epoch, loss.item()))
# 循环预测并插入数据
for i in range(24):
# 预测
with torch.no_grad():
x_test = torch.tensor(test_df['TotalPrice'].values[-1]).view(1, 1, 1).float()
y_test_pred = model(x_test)
y_test_pred = y_test_pred.item()
# 将预测结果插入表中
date = pd.to_datetime(test_df['Date'].iloc[-1]) + pd.DateOffset(months=1)
date_str = date.strftime('%Y/%m')
new_row = pd.DataFrame({'Date': date_str, 'TotalPrice': 2222222.5}, index=[len(df)])
df = pd.concat([df, new_row], axis=0)
test_df = df[train_size:]
if i == 0:
df.iloc[25-1]['TotalPrice'] = y_test_pred
elif i == 1:
df.iloc[26-1]['TotalPrice'] = y_test_pred
elif i == 2:
df.iloc[27-1]['TotalPrice'] = y_test_pred
elif i == 3:
df.iloc[28-1]['TotalPrice'] = y_test_pred
elif i == 4:
df.iloc[29-1]['TotalPrice'] = y_test_pred
elif i == 5:
df.iloc[30-1]['TotalPrice'] = y_test_pred
elif i == 6:
df.iloc[31-1]['TotalPrice'] = y_test_pred
elif i == 7:
df.iloc[32-1]['TotalPrice'] = y_test_pred
elif i == 8:
df.iloc[33-1]['TotalPrice'] = y_test_pred
elif i == 9:
df.iloc[34-1]['TotalPrice'] = y_test_pred
elif i == 10:
df.iloc[35-1]['TotalPrice'] = y_test_pred
elif i == 11:
df.iloc[36-1]['TotalPrice'] = y_test_pred
elif i == 12:
df.iloc[37-1]['TotalPrice'] = y_test_pred
elif i == 13:
df.iloc[38-1]['TotalPrice'] = y_test_pred
elif i == 14:
df.iloc[39-1]['TotalPrice'] = y_test_pred
elif i == 15:
df.iloc[40-1]['TotalPrice'] = y_test_pred
elif i == 16:
df.iloc[41-1]['TotalPrice'] = y_test_pred
elif i == 17:
df.iloc[42-1]['TotalPrice'] = y_test_pred
elif i == 18:
df.iloc[43-1]['TotalPrice'] = y_test_pred
elif i == 19:
df.iloc[44-1]['TotalPrice'] = y_test_pred
elif i == 20:
df.iloc[45-1]['TotalPrice'] = y_test_pred
elif i == 21:
df.iloc[46-1]['TotalPrice'] = y_test_pred
elif i == 22:
df.iloc[47-1]['TotalPrice'] = y_test_pred
elif i == 23:
df.iloc[48-1]['TotalPrice'] = y_test_pred
# 更新测试集
test_x = torch.tensor(test_df['TotalPrice'].values).view(-1, 1, 1).float()
test_y = torch.tensor(test_df['TotalPrice'].values).view(-1, 1, 1).float()
# 更新模型
for epoch in range(num_epochs):
outputs = model(train_x)
optimizer.zero_grad()
loss = criterion(outputs, train_y)
loss.backward()
optimizer.step()
print("Prediction %d: %1.5f" % (i+1, y_test_pred))
```
在代码中,我们首先读取了resultym.csv表,并对其中的TotalPrice列进行归一化处理。然后按照0.8的比例将数据集划分为训练集和测试集,并将它们转为PyTorch张量。接着定义了LSTM模型、损失函数和优化器,并进行了模型训练。
接着进入循环预测并插入数据的步骤。我们首先用训练好的模型对测试集中的最后一个数据进行预测,得到预测结果y_test_pred。然后将预测结果插入到表中,在插入前需要通过pd.to_datetime函数将之前的日期字符串转为datetime类型,并使用pd.DateOffset函数往后推一个月,得到新的日期字符串。插入的数据为{'Date': date_str, 'TotalPrice': 2222222.5},其中TotalPrice固定为2222222.5。然后根据预测结果更新表中对应的TotalPrice值。最后更新测试集,重新训练模型,并打印出当前的预测结果。
需要注意的是,我们在更新表中的TotalPrice值时,需要根据具体的行数来更新,因此使用了一长串的if-elif语句。如果有更好的方法,可以自行修改。
阅读全文