Harvest：高性能基本频率估计器

需积分: 10 153 浏览量更新于2024-09-02 1 收藏 158KB PDF 举报

"本文介绍了一种名为Harvest的高性能基频估计器，专门用于从语音信号中提取基频。该方法由Masanori Morise提出，他来自日本山梨大学工程学院。Harvest的独特之处在于它能可靠地估计基频轨迹，并减少将有声部分错误识别为无声部分的误差。其工作流程包括两个步骤：基频候选值的估计和基于这些候选值生成可靠的基频轨迹。首先，通过多带通滤波器提取基本频率成分，并从滤波信号中获取初步的基频候选值。然后，利用瞬时频率对基本候选值进行细化和评分，每帧中会估算出多个基频候选值。由于基于基本频率成分提取的逐帧处理对局部噪声不鲁棒，因此在第二步中使用了相邻基频的连接算法。这种连接方法利用了基频在时间上的连续性，提高了估计的稳健性。" 详细知识点： 1. **基频（Fundamental Frequency, F0）**：基频是声波中最基本的频率，对于语音信号而言，通常对应于声带振动的频率，它与音调有关。 2. **Harvest算法**：Harvest是一种高性能的基频估计器，其设计目标是提高估计的准确性和鲁棒性，特别是在有噪声的环境中。 3. **两步估计过程**： - **第一步：基频候选值估计** 使用多个带通滤波器（不同的中心频率）来提取基本频率成分，得到初步的基频候选值。 - **第二步：生成可靠基频轨迹** 利用瞬时频率信息对初步候选值进行细化和评分，然后选择每个帧内的多个候选值。同时，通过连接相邻帧的基频，利用时间连续性来提高估计的稳定性。 4. **瞬时频率（Instantaneous Frequency）**：瞬时频率是指在每个时间点上信号的频率，用于进一步优化基频候选值。 5. **噪声鲁棒性**：Harvest算法通过邻帧基频的连接算法来增强抗噪声能力，减少了局部噪声对基频估计的影响。 6. **误分类问题**：Harvest的一个关键优势是减少将有声部分误判为无声部分的错误，这对于语音识别和处理至关重要。 7. **应用领域**：该算法可能应用于语音处理、语音识别、音乐信号分析等多个领域，尤其是在需要精确基频估计的场景中。 8. **作者贡献**：Masanori Morise作为唯一作者，提出了这个创新的基频估计方法，展示了他在信号处理领域的专业性。 9. **学术背景**：作者来自于日本山梨大学工程学院，表明该研究具有一定的学术背景和实证基础。 10. **论文结构**：通常，一篇学术论文会包含摘要、引言、方法、实验结果和讨论等部分，此摘要主要介绍了Harvest算法的基本原理和优点，详细内容可能涉及更深入的技术细节和实验证据。

Harvest: A high-performance fundamental frequency estimator from

speech signals

Masanori Morise

Faculty of Engineering, University of Yamanashi, Japan

mmorise@yamanashi.ac.jp

Abstract

A fundamental frequency (F0) estimator named Harvest is de-

scribed. The unique points of Harvest are that it can obtain a

reliable F0 contour and reduce the error that the voiced sec-

tion is wrongly identiﬁed as the unvoiced section. It consists

of two steps: estimation of F0 candidates and generation of

a reliable F0 contour on the basis of these candidates. In the

ﬁrst step, the algorithm uses fundamental component extrac-

tion by many band-pass ﬁlters with different center frequencies

and obtains the basic F0 candidates from ﬁltered signals. Af-

ter that, basic F0 candidates are reﬁned and scored by using

the instantaneous frequency, and then several F0 candidates in

each frame are estimated. Since the frame-by-frame process-

ing based on the fundamental component extraction is not ro-

bust against temporally local noise, a connection algorithm us-

ing neighboring F0s is used in the second step. The connection

takes advantage of the fact that the F0 contour does not precip-

itously change in a short interval. We carried out an evalua-

tion using two speech databases with electroglottograph (EGG)

signals to compare Harvest with several state-of-the-art algo-

rithms. Results showed that Harvest achieved the best perfor-

mance of all algorithms.

Index Terms: speech analysis, fundamental frequency, funda-

mental component, instantaneous frequency

1. Introduction

Research on speech synthesis such as statistical parametric

speech synthesis (SPSS) [1] has recently been advancing, and

such synthesis requires a high-performance speech analyzer to

improve the sound quality. Speech parameters (fundamental

frequency (F0), spectral envelope, and aperiodicity) are widely

used for SPSS. Since SPSS requires a huge amount of speech

data for training, a high-performance speech analyzer would be

useful not only to improve the sound quality but also to avoid

having to perform post-processing by hand. There are a lot of

speech analyzers to choose from these days, and the appropri-

ate one depends on the purpose of the research. For example,

real-time voice conversion [2] requires a real-time F0 estima-

tor, whereas SPSS generally prioritizes the estimation accuracy

rather than the computational cost. In this study, we focus on

a high-performance F0 estimator named Harvest for a speech

analysis/synthesis system and SPSS.

In recent SPSS, deep neural networks (DNNs) [3] utilizing

continuous F0 modeling [4] have been used. This F0 modeling

gives a certain F0 to the unvoiced section by an interpolation

such as spline interpolation [5]. The F0 estimator preferred for

this modeling should have a function that gives a smooth F0

contour to all frames. Harvest is therefore designed to reduce

the error that the voiced section is wrongly identiﬁed as the un-

voiced section.

In Section 2 of this paper, we discuss works related to the

F0 estimation and give an overview of the proposed algorithm

(Harvest). In Section 3, we explain the details of Harvest, and

in Section 4, we perform an evaluation comparing Harvest with

several state-of-the-art F0 estimators and discuss the results. We

conclude in Section 5 with a brief summary and a mention of

future works.

2. Related works on F0 estimation

F0 is deﬁned as the shortest period of glottal vibrations. Many

methods for estimating F0 have been proposed for the var-

ious purposes required. Conventional F0 estimators have

used waveform features and the power spectrum [6]. Among

the waveform-based algorithms, average magnitude difference

function [7] and weighted auto-correlation [8] have been pro-

posed. YIN [9] is a major estimator, and an improved version

was developed [10] in 2014. As for the power-spectrum-based

algorithms, methods based on cepstrum [11, 12] have been

popular, and SWIPE



[13] was recently proposed as a high-

performance F0 estimator.

Which F0 estimator to use depends on the purpose of study.

For real-time speech analysis/synthesis applications [14, 2],

DIO [15] and its improved version [16] have been proposed. For

high-quality speech analysis/synthesis systems, NDF [17] used

in STRAIGHT [18] and XSX used in TANDEM-STRAIGHT

[19, 20] are preferred. In particular, pitch synchronous analy-

sis [21] can improve the estimation performance in the spectral

envelope and aperiodicity estimation. The estimation accuracy

is important in cases where the F0 is used as the input for es-

timating other speech parameters. CheapTrick [22, 23] used

in WORLD [24], F0-adaptive multi-frame integration analysis

[25], and D4C [26] require a high-performance F0 estimator.

For automatic speech recognition, since the system is often used

in noisy environments, a robust F0 estimator [27] would be use-

ful.

Harvest is proposed for high-quality speech analy-

sis/synthesis systems and for SPSS. In particular, since the con-

tinuous F0 modeling [4] gives a certain F0 to the unvoiced sec-

tion, Harvest attempts to reduce the unvoiced frame and give it

a reliable F0. The basic idea of Harvest is based on the event-

based F0 estimator [28] and utilizes fundamental component

extraction by ﬁltering [15]. It consists of two steps: estimation

of F0 candidates and generation of a reliable F0 contour on the

basis of these candidates.

3. Algorithm details

We explain the details of Harvest with speciﬁc values in param-

eters. These values were determined after tuning to minimize

the error rate in a speech database. Harvest requires a 1-ms

frame shift for estimation, but users can obtain the F0 with an

arbitrary frame shift by interpolation.

INTERSPEECH 2017

August 20–24, 2017, Stockholm, Sweden

http://dx.doi.org/10.21437/Interspeech.2017-682321

下载后可阅读完整内容，剩余4页未读，立即下载

安安爸Chris

粉丝: 9984
资源: 17

Harvest：高性能基本频率估计器

Python库 | harvest-for-mightyhive-0.0.1.tar.gz

Python库 | pytest_harvest-1.9.1-py2.py3-none-any.whl

化学计量学软件Unscrambler9.7-PLS1红外光谱分析建模-PCAPLS2PLS1.pdf

高中英语词汇经典例句整理----必修三.doc

Python库 | pytest-harvest-1.7.4.tar.gz

Python库 | harvest_python-0.3.3-py3-none-any.whl

matlab代码保密-Performance-Enhancement-for-Multi-hop-Harvest-to-Transmit-WS

PyPI 官网下载 | harvest_python-0.3.5-py3-none-any.whl

Low-valuation Stocks Entering Harvest Season.pdf

arc-spring-harvest-holidays-craft:春收假期Craft.io品网站

最新资源