基于音高状态的词典设计在单通道语音分离中的应用

121 浏览量更新于2024-08-26 收藏 1.05MB PDF 举报

"基于音高状态的词典设计方法在单通道语音分离中的应用" 这篇研究论文探讨了一种新的用于单通道语音分离的基于音高状态的词典设计方法。该方法旨在解决多说话人环境下的语音分离问题，尤其是在只有一个麦克风输入的情况下。作者包括Haiyan Guo、Zhen Yang、Linghua Zhang和Lei Ye，他们分别来自南京邮电大学宽带无线通信与传感器网络技术国家重点实验室和东南大学信息科学与工程学院。在论文中，作者提出词典设计分为两个阶段：子词典学习和子词典串联。首先，在子词典学习阶段，考虑到每个说话人的音高状态信息，以时域为基础，为每个说话人学习一组判别性的子词典。具体来说，每个子词典由具有相似音高状态的说话人的训练帧作为列构建的矩阵。这种方法强调了对音高变化的敏感性，有助于区分不同说话人的声音特征。其次，为了进一步优化子词典的规模，研究者采用频繁模式挖掘技术。这种技术可以有效地识别和提取最具代表性的语音模式，从而减少词典的大小，提高语音分离的效率和准确性。在子词典串联阶段，研究者提出选择合适的权重对来匹配学习到的子词典。这一步骤可能是为了平衡不同子词典的贡献，确保在分离过程中各说话人的语音能够准确地被识别和分离。该研究创新性地将音高状态信息引入到词典设计中，为单通道语音分离提供了更有效的解决方案。这一方法不仅考虑了语音的动态特性，还利用数据驱动的方法进行优化，有望提升语音处理系统的性能，特别是在复杂环境下的语音识别和分离任务中。这种方法对于未来开发更好的语音处理算法，特别是在物联网和智能设备等领域，具有重要的理论和实践价值。

A Pitch State Dependent Dictionary Design Method

for Single-Channel Speech Separation

Haiyan Guo

, Zhen Yang

, Linghua Zhang

, Lei Ye

1. Key Laboratory of Broadband Wireless Communication and Sensor Network Technology (Ministry of Education), Nanjing

University of Posts and Telecommunications, Nanjing, China

2. School of Information Science and Engineering, Southeast University, Nanjing, China

Abstract—In this paper, we propose to design a new pitch

state dependent dictionary to perform single-channel speech

separation. The dictionary is designed in two stages, which are

sub-dictionary learning and sub-dictionary concatenation. In

sub-dictionary learning, pitch state information is taken into

account to learn a set of discriminative sub-dictionaries for each

speaker in time-domain. To be specific, each sub-dictionary is

generated as a matrix composed of the speaker's training frames

of similar pitch states as columns. Moreover, we utilize a frequent

pattern mining method to further reduce the sub-dictionary size.

In sub-dictionary concatenation, we propose to select an

appropriate weight pair to match the learned sub-dictionaries to

generate a dictionary for separation. Experimental results show

that the proposed method achieves better overall performance

than two dictionary-based methods and a source-filter-based

method also using pitch information.

Keywords—speech separation; sparse decomposition;

dictionary; data mining

I. INTRODUCTION

Single-channel speech separation (SCSS) is considered as

the most difficult speech separation problem due to the lack of

information on mixing matrix. Recently, it is generally divided

into two groups: computational auditory scene analysis

(CASA) [1-3] and model-based method [4-14]. CASA tries to

achieve human performance based on the perceptual

organization of speech. Model-based method uses pre-trained

models or dictionaries to incorporate priori information to help

separation. Dictionary-based method, which performs

separation by mapping a mixture onto the dictionary, can be

considered as a type of model-based method. Recent works

show that it can improve the separation performance of the

dictionary-based method by using sparsity [15-18].

In dictionary-based method, there exist two main problems

to be solved, which are sub-dictionary learning and sparse

decomposition. In sub-dictionary learning, it generally learns

a uniform dictionary to represent all the source frames of a

speaker, e.g. non-negative matrix factorization (NMF) [15-16,

19-20] and K-SVD [21]. In this way, mixed frames are

separated mainly due to the speakers' characteristic. However,

there obviously exist other important features which can also

help separation, e.g. pitch information. Therefore, in this paper,

we propose a sub-dictionary learning method by taking pitch

state information into account as well. To be specific, we

propose to learn a set of discriminative sub-dictionaries for

each speaker. Each sub-dictionary is generated as a matrix

composed of the speaker's training frames of similar pitch

states as columns. And then, a frequent pattern mining method

is utilized to reduce the sub-dictionary size for computational

convenience.

In sparse decomposition, it aims to find the sparsest

representation of the mixture into the dictionary which is the

union of the learned sub-dictionaries. Each source is estimated

by computing the part which falls into the corresponding sub-

dictionary. Therefore, it is unavoidable that some energy of a

source

s is represented over a cross sub-dictionary ij

,D .

To suppress the cross representation, a discriminative

dictionary learning (DDL) based method is proposed in [18].

It takes the relationship between the sub-dictionaries into

account and optimizes sub-dictionaries jointly to learn a

structured dictionary. In [20], it proposes a new method of

discriminative learning of NMF which optimizes all basis

vectors jointly to reconstruct both clean and mixed signals as

well. In this paper, we propose to further suppress cross

representation by concatenating sub-dictionaries appropriately.

Specially, we propose to select an appropriate weight pair to

match the learned sub-dictionaries to generate a dictionary for

separation.

The rest of this paper is organized as follows. Dictionary-

based SCSS model is introduced in Section II. A pitch state

dependent dictionary designing method is proposed in Section

III, which includes two stages: pitch state dependent sub-

dictionary learning and sub-dictionary concatenation.

Experiments results are given in Section IV. Finally,

conclusions are summarized in Section V.

II. DICATIONARY-BASED SCSS MODEL

We consider the linear SCSS problem in which two

underlying speech sources 2,1, i

s need to be recovered

from a single speech mixture





)()(

tt sy (1)

Generally, separation is performed frame by frame with the

window length of l and



overlap.

Traditionally, in the dictionary-based method, it first learns

sub-dictionaries 2,1, i

D and then generate dictionary

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38582716

粉丝: 6
资源: 929

基于音高状态的词典设计在单通道语音分离中的应用

语音识别的语音词典 毕业设计语音词典 SpeechDict.

基于讯飞语音识别Demo

基于芯原DSP核的智能语音识别SoC设计

基于文本分类的方法相比基于情感词典的情感分类方法的优点

基于android的电子词典设计_基于安卓Android的电子词典的设计与实现

文本情感分析中基于情感词典分类方法的准确率大概范围是多少？基于情感词典和朴素贝叶斯分类模型哪个准确率更高？

帮我写一下基于深度学习的语音识别系统

基于半监督学习的蒙古语未登录词语音识别研究

基于情感词典的文本分析方法

基于情感词典的方法有什么优点和缺点

最新资源

语音识别的语音词典毕业设计语音词典 SpeechDict.