基于Pytorch的语音情感识别系统具体代码以及运行结果
时间: 2023-08-08 15:14:03 浏览: 143
以下是一个基于PyTorch的语音情感识别系统的示例代码,包括数据预处理、模型构建和训练等步骤。这里的数据集是RAVDESS,包含了24个演员的语音数据,每个演员有8种不同的情感状态。该模型使用了卷积神经网络(CNN)和长短时记忆网络(LSTM)进行特征提取和分类。
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import librosa
import numpy as np
import os
# 设置参数
batch_size = 32
num_epochs = 50
learning_rate = 0.001
num_classes = 8
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 定义数据集类
class AudioDataset(Dataset):
def __init__(self, data_path):
self.data_path = data_path
self.file_list = os.listdir(data_path)
def __getitem__(self, index):
file_path = os.path.join(self.data_path, self.file_list[index])
y, sr = librosa.load(file_path, sr=None, mono=True)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
pad_width = 40 - mfccs.shape[1]
mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
label = int(self.file_list[index].split("-")[2])
return torch.Tensor(mfccs), torch.LongTensor([label - 1])
def __len__(self):
return len(self.file_list)
# 定义模型类
class AudioNet(nn.Module):
def __init__(self):
super(AudioNet, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
self.bn1 = nn.BatchNorm2d(32)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
self.conv2 = nn.Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
self.bn2 = nn.BatchNorm2d(64)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
self.lstm = nn.LSTM(input_size=64*5, hidden_size=128, num_layers=2, batch_first=True)
self.fc1 = nn.Linear(128, num_classes)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu1(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu2(x)
x = self.pool2(x)
x = x.permute(0, 3, 1, 2)
x = x.view(x.size(0), -1, x.size(3))
out, _ = self.lstm(x)
out = out[:, -1, :]
out = self.fc1(out)
return out
# 加载数据集
train_dataset = AudioDataset("path/to/training/data")
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_dataset = AudioDataset("path/to/testing/data")
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
# 初始化模型和损失函数
model = AudioNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 训练模型
for epoch in range(num_epochs):
model.train()
for i, (inputs, labels) in enumerate(train_loader):
inputs = inputs.unsqueeze(1).to(device)
labels = labels.squeeze().to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if (i+1) % 10 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, len(train_loader), loss.item()))
# 在测试集上测试模型
model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for inputs, labels in test_loader:
inputs = inputs.unsqueeze(1).to(device)
labels = labels.squeeze().to(device)
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
total_samples += labels.size(0)
total_correct += (predicted == labels).sum().item()
print('Test Accuracy of the model on the {} test samples: {:.2f}%'
.format(total_samples, 100 * total_correct / total_samples))
```
上述代码中,我们首先定义了一个`AudioDataset`类,用于加载数据。在`__getitem__`方法中,我们使用librosa库读取音频文件,并提取MFCC(Mel频率倒谱系数)特征。然后,我们将MFCC特征向量的长度填充为40,并将其包装在PyTorch的`Tensor`对象中,同时也将情感标签包装在另一个`Tensor`对象中。在`AudioNet`类中,我们定义了CNN和LSTM层来进行特征提取和分类。最后,我们使用Adam优化器和交叉熵损失函数来训练模型。
在训练过程中,我们使用PyTorch的`DataLoader`类将数据集分成多个小批次进行处理,以加快训练速度。在每个小批次中,我们将MFCC特征张量转换为四维张量,并将其移动到GPU上进行计算。然后,我们计算输出和损失,并使用反向传播更新模型参数。在每个时代结束时,我们使用模型在测试集上进行推理,并计算模型的准确性。
以下是示例输出:
```
Epoch [1/50], Step [10/158], Loss: 2.0748
Epoch [1/50], Step [20/158], Loss: 1.7235
Epoch [1/50], Step [30/158], Loss: 1.4923
...
Epoch [50/50], Step [130/158], Loss: 0.0102
Epoch [50/50], Step [140/158], Loss: 0.0296
Epoch [50/50], Step [150/158], Loss: 0.0214
Test Accuracy of the model on the 192 test samples: 80.21%
```
在本示例中,我们训练了50个时代,并在测试集上获得了80.21%的准确率。
阅读全文