Harvest: A high-performance fundamental frequency estimator from
speech signals
Masanori Morise
1
1
Faculty of Engineering, University of Yamanashi, Japan
mmorise@yamanashi.ac.jp
Abstract
A fundamental frequency (F0) estimator named Harvest is de-
scribed. The unique points of Harvest are that it can obtain a
reliable F0 contour and reduce the error that the voiced sec-
tion is wrongly identified as the unvoiced section. It consists
of two steps: estimation of F0 candidates and generation of
a reliable F0 contour on the basis of these candidates. In the
first step, the algorithm uses fundamental component extrac-
tion by many band-pass filters with different center frequencies
and obtains the basic F0 candidates from filtered signals. Af-
ter that, basic F0 candidates are refined and scored by using
the instantaneous frequency, and then several F0 candidates in
each frame are estimated. Since the frame-by-frame process-
ing based on the fundamental component extraction is not ro-
bust against temporally local noise, a connection algorithm us-
ing neighboring F0s is used in the second step. The connection
takes advantage of the fact that the F0 contour does not precip-
itously change in a short interval. We carried out an evalua-
tion using two speech databases with electroglottograph (EGG)
signals to compare Harvest with several state-of-the-art algo-
rithms. Results showed that Harvest achieved the best perfor-
mance of all algorithms.
Index Terms: speech analysis, fundamental frequency, funda-
mental component, instantaneous frequency
1. Introduction
Research on speech synthesis such as statistical parametric
speech synthesis (SPSS) [1] has recently been advancing, and
such synthesis requires a high-performance speech analyzer to
improve the sound quality. Speech parameters (fundamental
frequency (F0), spectral envelope, and aperiodicity) are widely
used for SPSS. Since SPSS requires a huge amount of speech
data for training, a high-performance speech analyzer would be
useful not only to improve the sound quality but also to avoid
having to perform post-processing by hand. There are a lot of
speech analyzers to choose from these days, and the appropri-
ate one depends on the purpose of the research. For example,
real-time voice conversion [2] requires a real-time F0 estima-
tor, whereas SPSS generally prioritizes the estimation accuracy
rather than the computational cost. In this study, we focus on
a high-performance F0 estimator named Harvest for a speech
analysis/synthesis system and SPSS.
In recent SPSS, deep neural networks (DNNs) [3] utilizing
continuous F0 modeling [4] have been used. This F0 modeling
gives a certain F0 to the unvoiced section by an interpolation
such as spline interpolation [5]. The F0 estimator preferred for
this modeling should have a function that gives a smooth F0
contour to all frames. Harvest is therefore designed to reduce
the error that the voiced section is wrongly identified as the un-
voiced section.
In Section 2 of this paper, we discuss works related to the
F0 estimation and give an overview of the proposed algorithm
(Harvest). In Section 3, we explain the details of Harvest, and
in Section 4, we perform an evaluation comparing Harvest with
several state-of-the-art F0 estimators and discuss the results. We
conclude in Section 5 with a brief summary and a mention of
future works.
2. Related works on F0 estimation
F0 is defined as the shortest period of glottal vibrations. Many
methods for estimating F0 have been proposed for the var-
ious purposes required. Conventional F0 estimators have
used waveform features and the power spectrum [6]. Among
the waveform-based algorithms, average magnitude difference
function [7] and weighted auto-correlation [8] have been pro-
posed. YIN [9] is a major estimator, and an improved version
was developed [10] in 2014. As for the power-spectrum-based
algorithms, methods based on cepstrum [11, 12] have been
popular, and SWIPE
[13] was recently proposed as a high-
performance F0 estimator.
Which F0 estimator to use depends on the purpose of study.
For real-time speech analysis/synthesis applications [14, 2],
DIO [15] and its improved version [16] have been proposed. For
high-quality speech analysis/synthesis systems, NDF [17] used
in STRAIGHT [18] and XSX used in TANDEM-STRAIGHT
[19, 20] are preferred. In particular, pitch synchronous analy-
sis [21] can improve the estimation performance in the spectral
envelope and aperiodicity estimation. The estimation accuracy
is important in cases where the F0 is used as the input for es-
timating other speech parameters. CheapTrick [22, 23] used
in WORLD [24], F0-adaptive multi-frame integration analysis
[25], and D4C [26] require a high-performance F0 estimator.
For automatic speech recognition, since the system is often used
in noisy environments, a robust F0 estimator [27] would be use-
ful.
Harvest is proposed for high-quality speech analy-
sis/synthesis systems and for SPSS. In particular, since the con-
tinuous F0 modeling [4] gives a certain F0 to the unvoiced sec-
tion, Harvest attempts to reduce the unvoiced frame and give it
a reliable F0. The basic idea of Harvest is based on the event-
based F0 estimator [28] and utilizes fundamental component
extraction by filtering [15]. It consists of two steps: estimation
of F0 candidates and generation of a reliable F0 contour on the
basis of these candidates.
3. Algorithm details
We explain the details of Harvest with specific values in param-
eters. These values were determined after tuning to minimize
the error rate in a speech database. Harvest requires a 1-ms
frame shift for estimation, but users can obtain the F0 with an
arbitrary frame shift by interpolation.
Copyright © 2017 ISCA
INTERSPEECH 2017
August 20–24, 2017, Stockholm, Sweden
http://dx.doi.org/10.21437/Interspeech.2017-682321