深度神经网络驱动的钢琴演奏评估系统

7 浏览量更新于2024-08-29 收藏 1.96MB PDF 举报

"这篇研究论文介绍了一种使用基于深度神经网络的声学建模的音频基钢琴演奏评估方法，旨在评估初学者的钢琴演奏水平。论文由中山大学与卡内基梅隆大学联合工程学院和山东英才大学学前教育学院的研究人员共同完成。该框架包括三个主要部分：钢琴键后验概率提取、基于动态时间规整（DTW）的匹配以及性能评分回归。首先，通过训练深度神经网络模型，从常数Q变换（CQT）谱中提取88维钢琴键特征，展示出对录音环境的高度鲁棒性。其次，利用DTW算法对高层级的钢琴键特征序列进行对齐，使输入与模板匹配。最后，通过回归模型计算演奏得分。" 在当前的音乐教育领域，客观、准确地评估学生的钢琴演奏能力是一个挑战。传统的评估方式依赖于经验丰富的教师的主观判断，这种方法可能存在主观性和不一致性。因此，研究者们提出了一种创新的自动化评估方法，利用深度学习技术来处理这个问题。该方法的核心是深度神经网络（DNN）模型，它能从CQT谱中学习并提取钢琴键特征。CQT是一种频谱分析方法，它在音乐信号处理中广泛应用，因为它能更好地捕捉音乐中的音调信息。DNN模型经过训练后，能够从音频信号中识别出88个琴键的活动概率，这88个维度对应钢琴的88个键，覆盖了所有可能的音符。动态时间规整（DTW）是序列匹配的一种经典算法，尤其适用于时序数据的对齐。在本研究中，DTW被用来比较学生演奏的音频与预设的模板之间的相似度。通过DTW，可以适应不同速度的演奏，找到最佳的对齐方式，从而更准确地评估演奏的质量。最后，性能评分回归阶段，研究者使用一个回归模型（可能是另一个神经网络或传统的统计模型）根据对齐后的特征序列预测演奏得分。这个得分可以量化演奏的准确性、节奏感和其他关键因素，为教学提供客观的反馈。这种基于深度学习的钢琴演奏评估方法为音乐教育提供了一种新的工具，它能够克服传统评估的局限性，为教师和学生提供更加精确、一致的反馈。未来的研究可能会进一步优化模型，提高评估的精度，或者扩展到其他乐器和演奏风格。

An audio based piano performance evaluation method using

deep neural network based acoustic modeling

Jing Pan

, Ming Li

, Zhanmei Song

, Xin Li

, Xiaolin Liu

, Hua Yi

, Manman Zhu

SYSU-CMU Joint Institute of Engineering, School of Electronics and Information Technology

Sun Yat-sen University, Guangzhou, China

School of Preschool Education, Shandong Yingcai University, Jinan, China

liming46@mail.sysu.edu.cn songzhanmei@126.com

Abstract

In this paper, we propose an annotated piano performance eval-

uation dataset with 185 audio pieces and a method to evalu-

ate the performance of piano beginners based on their audio

recordings. The proposed framework includes three parts: pi-

ano key posterior probability extraction, Dynamic Time Warp-

ing (DTW) based matching and performance score regression.

First, a deep neural network model is trained to extract 88

dimensional piano key features from Constant-Q Transform

(CQT) spectrum. The proposed acoustic model shows high ro-

bustness to the recording environments. Second, we employ the

DTW algorithm on the high-level piano key feature sequences

to align the input with the template. Upon the alignment, we

extract multiple global matching features that could reﬂect the

similarity between the input and the template. Finally, we apply

linear regression upon these matching features with the scores

annotated by expertise in training data to estimate performance

scores for test audio. Experimental results show that our auto-

matic evaluation method achieves 2.64 average absolute score

error in score range from 0 to 100, and 0.73 average correlation

coefﬁcient on our in-house collected YCU-MPPE-II dataset.

Index Terms: Piano Performance Evaluation, Music Analysis,

Convolutional Neural Network, Dynamic Time Warping, Com-

puter Assisted Piano Learning

1. Introduction

Nowadays, more and more beginners are trying to learn musical

instruments by themselves with on-line resources, and learn-

ing piano is a popular choice. However, the beginners need a

lot of practice. Effective practicing needs immediate feedback

such as the advice from piano instructor. Since manually perfor-

mance evaluation is both time and labor consuming, we intend

to propose an audio based piano performance evaluation system

to offer feedbacks to the beginners. Intuitively, a good perfor-

mance should have large similarity with the template and vice

versa. This system takes the piano audio recording as input,

and outputs multiple objective performance evaluation metrics,

such as the overall performance score and the mistakes that the

performer has made. In this paper, we mainly focus on predict-

ing the expert generated performance score, which is an overall

feedback based on the performance.

There has been some efforts made on the automatic piano

performance evaluation. Morita, et.al[1] take MIDI sequence

This research was funded in part by the National Natural Sci-

ence Foundation of China (61401524), Natural Science Foundation of

Guangdong Province (2014A030313123), Natural ScienceFoundation

of Guangzhou City(201707010363), the Fundamental Research Funds-

for the Central Universities(15lgjc12) ?National Key Research andDe-

velopment Program (2016YFC0103905) and IBM Faculty Award.

generated by the electrical piano when the player is playing on

the keyboard as inputs. The MIDI sequence records the on-

set, velocity and duration of each music note which are used to

predict the performance score by spline regression and the av-

erage correlation coefﬁcient between system estimated scores

and experts evaluated scores is 0.6. Akinaga, et.al[2] also take

the MIDI sequences as input and apply Karhunen-Loeve(KL)

expansion and K-nearest neighbors (KNN) algorithm on the in-

terval, velocity and duration of each note to predict the perfor-

mance score. Existing methods mainly focus on the applica-

tion to electrical pianos with MIDI output function. In real life,

many pianos can not generate MIDI ﬁles and the extra MIDI

collection equipments cost also prevents its large scale usage.

In our case, the system input is the audio signal captured by any

microphone which makes the application useful for all types of

pianos and possible on all mobile devices.

Generally when we try to evaluate an input audio’s perfor-

mance, the music score is a nature solid ground truth. There-

fore, we need to transcribe both the music score and the input

audio into MIDI sequences to measure their similarity. Tran-

scribing music audio to MIDI sequence itself is a well deﬁned

task, called Automatic Music Transcription (AMT)[3]. Cur-

rently, most of the proposed AMT method are based on describ-

ing the input spectrogram as the weighted combination of basis

spectra corresponding to the music pitches, which could be es-

timated by Non-Negative Matrix Factorization and sparse com-

position [4] [5]. The unsupervised factorization often leads to

small correspondence to the music pitches, causing issues in in-

terrupting the results. The problem is often addressed by apply-

ing harmonic constraints in the training stage [6], [7]. The sup-

port vector machine is also applied to AMT by classifying the

normalized magnitude spectra [8]. Recently, deep learning[9]

has been applied to AMT. Dixon and Benetos proposed an End-

to-End deep neural network approach with a combined Con-

volutional Neural Network (CNN) and Recurrent Neural Net-

work (RNN) framework, transcribing the spectrogram to the es-

timated music score [10].

Since our target is to estimate the performance score rather

than to transcribe the melody, we adopt the piano key posterior

probabilities (PKPP) generated by the acoustic model as our

features for the subsequent matching and regression. The low-

level features such as spectrum and MFCC may contain the en-

vironmental noise, reverberation, and channel mismatch while

the AMT system’s acoustic model output (such as the PKPP

in [10]) can be seen as a better approximation to what the per-

former really plays, which would potentially beneﬁt the perfor-

mance score estimation. We train a convolutional neural net-

work as an acoustic model of the piano sound. The input of the

network is Constant Q Transform(CQT) [11] spectrum and the

INTERSPEECH 2017

August 20–24, 2017, Stockholm, Sweden

http://dx.doi.org/10.21437/Interspeech.2017-8663088

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38688890

粉丝: 6

深度神经网络驱动的钢琴演奏评估系统

汉语语音识别优化：深度神经网络声学建模新策略

基于无监督建模的鸟类物种识别：深度神经网络与HMM融合方法

深度神经网络驱动的音频特征提取与场景识别创新研究

基于深度神经网络的蒙古语声学模型建模研究.pdf

基于深层神经网络的多特征关联声学建模方法.pdf

低资源语音识别中融合多流特征的卷积神经网络声学建模方法.pdf

低资源语音识别中融合多流特征的卷积神经网络声学建模方法_秦楚雄.caj

基于深层神经网络的语音识别声学建模研究.pdf

基于深层神经网络的语音识别声学建模研究_周盼.caj

基于深度神经网络的语音情感识别方法.pdf

最新资源