An audio based piano performance evaluation method using
deep neural network based acoustic modeling
Jing Pan
1
, Ming Li
1
, Zhanmei Song
2
, Xin Li
2
, Xiaolin Liu
2
, Hua Yi
2
, Manman Zhu
2
1
SYSU-CMU Joint Institute of Engineering, School of Electronics and Information Technology
Sun Yat-sen University, Guangzhou, China
2
School of Preschool Education, Shandong Yingcai University, Jinan, China
liming46@mail.sysu.edu.cn songzhanmei@126.com
Abstract
In this paper, we propose an annotated piano performance eval-
uation dataset with 185 audio pieces and a method to evalu-
ate the performance of piano beginners based on their audio
recordings. The proposed framework includes three parts: pi-
ano key posterior probability extraction, Dynamic Time Warp-
ing (DTW) based matching and performance score regression.
First, a deep neural network model is trained to extract 88
dimensional piano key features from Constant-Q Transform
(CQT) spectrum. The proposed acoustic model shows high ro-
bustness to the recording environments. Second, we employ the
DTW algorithm on the high-level piano key feature sequences
to align the input with the template. Upon the alignment, we
extract multiple global matching features that could reflect the
similarity between the input and the template. Finally, we apply
linear regression upon these matching features with the scores
annotated by expertise in training data to estimate performance
scores for test audio. Experimental results show that our auto-
matic evaluation method achieves 2.64 average absolute score
error in score range from 0 to 100, and 0.73 average correlation
coefficient on our in-house collected YCU-MPPE-II dataset.
Index Terms: Piano Performance Evaluation, Music Analysis,
Convolutional Neural Network, Dynamic Time Warping, Com-
puter Assisted Piano Learning
1. Introduction
Nowadays, more and more beginners are trying to learn musical
instruments by themselves with on-line resources, and learn-
ing piano is a popular choice. However, the beginners need a
lot of practice. Effective practicing needs immediate feedback
such as the advice from piano instructor. Since manually perfor-
mance evaluation is both time and labor consuming, we intend
to propose an audio based piano performance evaluation system
to offer feedbacks to the beginners. Intuitively, a good perfor-
mance should have large similarity with the template and vice
versa. This system takes the piano audio recording as input,
and outputs multiple objective performance evaluation metrics,
such as the overall performance score and the mistakes that the
performer has made. In this paper, we mainly focus on predict-
ing the expert generated performance score, which is an overall
feedback based on the performance.
There has been some efforts made on the automatic piano
performance evaluation. Morita, et.al[1] take MIDI sequence
This research was funded in part by the National Natural Sci-
ence Foundation of China (61401524), Natural Science Foundation of
Guangdong Province (2014A030313123), Natural ScienceFoundation
of Guangzhou City(201707010363), the Fundamental Research Funds-
for the Central Universities(15lgjc12) ?National Key Research andDe-
velopment Program (2016YFC0103905) and IBM Faculty Award.
generated by the electrical piano when the player is playing on
the keyboard as inputs. The MIDI sequence records the on-
set, velocity and duration of each music note which are used to
predict the performance score by spline regression and the av-
erage correlation coefficient between system estimated scores
and experts evaluated scores is 0.6. Akinaga, et.al[2] also take
the MIDI sequences as input and apply Karhunen-Loeve(KL)
expansion and K-nearest neighbors (KNN) algorithm on the in-
terval, velocity and duration of each note to predict the perfor-
mance score. Existing methods mainly focus on the applica-
tion to electrical pianos with MIDI output function. In real life,
many pianos can not generate MIDI files and the extra MIDI
collection equipments cost also prevents its large scale usage.
In our case, the system input is the audio signal captured by any
microphone which makes the application useful for all types of
pianos and possible on all mobile devices.
Generally when we try to evaluate an input audio’s perfor-
mance, the music score is a nature solid ground truth. There-
fore, we need to transcribe both the music score and the input
audio into MIDI sequences to measure their similarity. Tran-
scribing music audio to MIDI sequence itself is a well defined
task, called Automatic Music Transcription (AMT)[3]. Cur-
rently, most of the proposed AMT method are based on describ-
ing the input spectrogram as the weighted combination of basis
spectra corresponding to the music pitches, which could be es-
timated by Non-Negative Matrix Factorization and sparse com-
position [4] [5]. The unsupervised factorization often leads to
small correspondence to the music pitches, causing issues in in-
terrupting the results. The problem is often addressed by apply-
ing harmonic constraints in the training stage [6], [7]. The sup-
port vector machine is also applied to AMT by classifying the
normalized magnitude spectra [8]. Recently, deep learning[9]
has been applied to AMT. Dixon and Benetos proposed an End-
to-End deep neural network approach with a combined Con-
volutional Neural Network (CNN) and Recurrent Neural Net-
work (RNN) framework, transcribing the spectrogram to the es-
timated music score [10].
Since our target is to estimate the performance score rather
than to transcribe the melody, we adopt the piano key posterior
probabilities (PKPP) generated by the acoustic model as our
features for the subsequent matching and regression. The low-
level features such as spectrum and MFCC may contain the en-
vironmental noise, reverberation, and channel mismatch while
the AMT system’s acoustic model output (such as the PKPP
in [10]) can be seen as a better approximation to what the per-
former really plays, which would potentially benefit the perfor-
mance score estimation. We train a convolutional neural net-
work as an acoustic model of the piano sound. The input of the
network is Constant Q Transform(CQT) [11] spectrum and the
Copyright © 2017 ISCA
INTERSPEECH 2017
August 20–24, 2017, Stockholm, Sweden
http://dx.doi.org/10.21437/Interspeech.2017-8663088