Matlab实例：语音与音频处理实战指南

需积分: 10 55 浏览量更新于2024-07-21 收藏 2.12MB PDF 举报

"《应用语音与音频处理：MATLAB实例》是一本专注于语音和听力研究的实用教材，它以MATLAB为基础，提供了一站式的资源。该书旨在介绍语音和音频处理的核心技术，通过丰富的Matlab示例帮助读者深入理解和实践。书中首先涵盖了基础的音频处理和语音特性的探讨，这部分为后续更高级的语音信号处理奠定了坚实的基础。作者细致地讲解了音频处理、编码、压缩和分析等技术，确保读者能够掌握这些关键技术的实际应用。在最后一章，作者带领读者探索了一系列先进主题，如心理声学建模，这是一个关键领域，对MP3和其他音频格式的发展起到了支撑作用。心理声学模型对于理解人类听觉感知和音频数据压缩至关重要。本书的作者Ian McLoughlin是新加坡南洋理工大学计算机工程学院的副教授，他的专业背景使得这本书不仅具有学术深度，而且具有很高的实用性。无论是研究生还是从事语音或音频系统工作的专业人士，都能从中受益匪浅，通过实践学习到MATLAB在语音和音频处理领域的实际操作技巧。因此，这本书不仅适合进行理论研究，也是进行技术开发和项目实践的宝贵参考资源。"

4 Introduction

a periodic clock signal fed to the ADC and DAC, although there is no reason why

both need the same sample rate – digital processing can be used to change sample rate.

Using the well-known Nyquist criterion, the highest frequency that can be

unambiguously represented by such a stream of samples is half of the sampling rate.

Samples themselves as delivered by ADC are generally ﬁxed point with a resolution

of 16 bits, although 20 bits and even up to 24 bits are found in high-end audio systems.

Handling these on computer could utilise either ﬁxed or ﬂoating point representation

(ﬁxed point meaning each sample is a scaled integer, while ﬂoating point allows frac-

tional representation), with a general rule of thumb for reasonable quality being that 20

bits ﬁxed point resolution is desirable for performing processing operations in a system

with 16-bit input and output.

In the absence of other factors, the general rule is that an n bit uniformly sampled

digital audio signal will have a dynamic range (the ratio of the biggest amplitude that

can be represented in the system to the smallest one) of, at best:

DR(dB) = 6.02 × n. (1.1)

For telephone-quality speech, resolutions as low as 8–12 bits are possible depending on

the application. For GSM-type mobile phones, 14 bits is common. Telephone-quality,

often referred to as toll-quality, is perfectly reasonable for vocal communications, but is

not perceived as being of particularly high quality. For this reason, more modern vocal

communication systems have tended to move beyond 8 bits sample resolution in practice.

Sample rates vary widely from 7.2 kHz or 8 kHz for telephone-quality audio to

44.1 kHz for CD-quality audio. Long-play style digital audio systems occasionally opt

for 32 kHz, and high-quality systems use 48 kHz. A recent trend is to double this to

96 kHz. It is debatable whether a sampling rate of 96 kHz is at all useful to the human

ear which can typically not resolve signals beyond about 18 kHz, apart from the rare

listeners having golden ears.

However such systems may be more pet-friendly: dogs

are reportedly able to hear up to 44 kHz and cats up to almost 80 kHz.

The die-hard audio enthusiasts who prefer valve ampliﬁers, pay several years’ salary for a pair of

loudspeakers, and often claim they can hear above 20 kHz, are usually known as having golden ears.

Infobox 1.1 Audio ﬁdelity

Something to note is the inexactness of the entire conversion process: what you hear is a wave

impinging on the eardrum, but what you obtain on the computer has travelled some way through

air, possibly bounced past several obstructions, hit a microphone, vibrated a membrane, been

converted to an electrical signal, ampliﬁed, and then sampled. Ampliﬁers add noise, distortion,

and are not entirely linear. Microphones are usually far worse on all counts. Analogue-to-digital

converters also suffer linearity errors, add noise, distortion, and introduce quantisation error due

to the precision of their voltage sampling process. The result of all this is a computerised sequence

of samples that may not be as closely related to the real-world sound as you might expect. Do not

be surprised when high-precision analysis or measurements are unrepeatable due to noise, or if

delicate changes made to a sampled audio signal are undetectable to the naked ear upon replay.

1.4. Summary

Table 1.1. Sampling characteristics of common applications.

Application Sample rate, resolution Used how

telephony 8 kHz, 8–12 bits 64 kbps A-law or µ-law

voice conferencing 16 kHz, 14–16 bits 64 kbps SB-ADPCB

mobile phone 8 kHz, 14–16 bits 13 kbps GSM

private mobile radio 8 kHz, 12–16 bits <5 kbps, e.g. TETRA

long-play audio 32 kHz, 14–16 bits minidisc, DAT, MP3

CD audio 44.1 kHz, 16–24 bits stored on CDs

studio audio 48 kHz, 16–24 bits CD mastering

very high end 96 kHz, 20–24 bits for golden ears listening

Sample rates and sampling precisions for several common applications, for humans

at least, are summarised in Table 1.1.

1.4 Summary

Most of the technological detail related to the conversion and transmission process is

outside the scope of this book, although some excellent resources covering this can

be found in the bibliography. Generally, the audio processing specialist is fortunate

enough to be able to work with digital audio without being too concerned with how

it was captured, or how it will be replayed. Thus, we will conﬁne our discussions

throughout the remainder of this text primarily to the processing/storage/transmission,

recognition/analysis and synthesis/generation blocks in Figure 1.1, ignoring the messy

analogue detail.

Sound, as known to humans, has several attributes. These include time-domain

attributes of duration, rhythm, attack and decay, but also frequency domain attributes of

tone and pitch. Other, less well-deﬁned attributes, include quality, timbre and tonality.

Often, a sound wave conveys meaning: for example a ﬁre alarm, the roar of a lion, the

cry of a baby, a peal of thunder or a national anthem.

However, as we have seen, sound sampled by an ADC (at least the more common

pulse coded modulation-based ADCs) is simply represented as a vector of samples,

with each element in the vector representing the amplitude at that particular instant of

time. The remainder of this book attempts to bridge the gap between such a vector of

numbers representing audio, and an understanding or interpretation of the meaning of that

audio.

Basic audio processing

Audio is normal and best handled by Matlab, when stored as a vector of samples, with

each individual value being a double-precision ﬂoating point number. A sampled sound

can be completely speciﬁed by the sequence of these numbers plus one other item of

information: the sample rate. In general, the majority of digital audio systems differ from

this in only one major respect, and that is they tend to store the sequence of samples as

ﬁxed-point numbers instead. This can be a complicating factor for those other systems,

but an advantage to Matlab users who have two less considerations to be concerned

with when processing audio: namely overﬂow and underﬂow.

Any operation that Matlab can perform on a vector can, in theory, be performed

on stored audio. The audio vector can be loaded and saved in the same way as any

other Matlab variable, processed, added, plotted, and so on. However there are of

course some special considerations when dealing with audio that need to be discussed

within this chapter, as a foundation for the processing and analysis discussed in the later

chapters.

This chapter begins with an overview of audio input and output in Matlab,

including recording and playback, before considering scaling issues, basic processing

methods, then aspects of continuous analysis and processing. A section on visualisation

covers the main time- and frequency-domain plotting techniques. Finally, methods of

generating sounds and noise are given.

2.1 Handling audio in M

ATLAB

Given a high enough sample rate, the double precision vector has sufﬁcient resolution

for almost any type of processing that may need to be performed – meaning that one can

usually safely ignore quantisation issues when in the Matlab environment. However

there are potential resolution and quantisation concerns when dealing with input to and

output from Matlab, since these will normally be in a ﬁxed-point format. We shall

thus discuss input and output: ﬁrst, audio recording and playback, and then audio ﬁle

handling in Matlab.

8 Basic audio processing

2.1.1 Recording sound

Recording sound directly in Matlab requires the user to specify the number of samples

to record, the sample rate, number of channels and sample format. For example, to

record a vector of double precision ﬂoating point samples on a computer with attached

or integrated microphone, the following Matlab command may be issued:

speech=wavrecord(16000,8000,1,’double’);

This records 16 000 samples with a sample rate of 8 kHz, and places them into a

16 000 element vector named speech. The ‘1’ argument speciﬁes that the recording

is mono rather than stereo. This command only works under Windows, so under Linux

or MacOS it is best to use either the Matlab audiorecorder() function, or use a

separate audio application to record audio (such as the excellent open source audacity

tool), saving the recorded sound as an audio ﬁle, to be loaded into Matlab as we shall

see shortly.

Infobox 2.1 Audio ﬁle formats

Wave: The wave ﬁle format is usually identiﬁed by the ﬁle extension .wav, and actually can hold

many different types of audio data identiﬁed by a header ﬁeld at the beginning of the ﬁle. Most

importantly, the sampling rate, number of channels and number of bits in each sample are also

speciﬁed. This makes the format very easy to use compared to other formats that do not specify

such information, and thankfully this format is recognised by Matlab. Normally for audio work,

the wave ﬁle would contain PCM data, with a single channel (mono), and 16 bits per sample.

Sample rate could vary from 8000 Hz up to 48 000 Hz. Some older PC sound cards are limited

in the sample rates they support, but 8000 Hz and 44 100 Hz are always supported. 16 000 Hz,

24 000 Hz, 32 000 Hz and 48 000 Hz are also reasonably common.

PCM and RAW hold streams of pulse coded modulation data with no headers or gaps. They

are assumed to be single channel (mono) but the sample rate and number of bits per sample are

not speciﬁed in the ﬁle – the audio researcher must remember what these are for each .pcm or .raw

ﬁle that he or she keeps. These can be read from and written to by Matlab, but are not supported

as a distinctive audio ﬁle. However these have historically been the formats of choice for audio

researchers, probably because research software written in C, C++ and other languages can most

easily handle this format.

A-law and µ-law are logarithmically compressed audio samples in byte format. Each byte

represents something like 12 bits in equivalent linear PCM format. This is commonly used in

telecommunications where the sample rate is 8 kHz. Again, however, the .au ﬁle extension (which

is common on UNIX machines, and supported under Linux) does not contain any information

on sample rate, so the audio researcher must remember this. Matlab does support this format

natively.

Other formats include those for compressed music such as MP3 (see Infobox: Music ﬁle formats

on page 11), MP4, specialised musical instrument formats such as MIDI (musical instrument

digital interface) and several hundred different proprietary audio formats.

If using the audiorecorder() function, the procedure is ﬁrst to create an audio

recorder object, specifying sample rate, sample precision in bits, and number of channels,

then to begin recording:

剩余217页未读，继续阅读

ssqre

粉丝: 19
资源: 31

Matlab实例：语音与音频处理实战指南

matlab音效处理

matlab开发-VoiceAudioProcessing

Applied_Speech_and_Audio_Processing_With_Matlab_E_processing

Matlab_examples.zip_matlab例程_matlab_

matlab-pca-examples-master.rar_matlab例程_matlab__matlab例程_matlab_

Time_Series_Analysis_and_its_Applications_with_R_Examples_4th_ed._（2017）

PyPI 官网下载 | batchkit_examples_speechsdk-0.9.1.tar.gz

wavelet2_examples.zip_matlab例程_matlab_

Python库 | batchkit_examples_speechsdk-0.9.1-py3-none-any.whl

auto_examples_python_it_matlabGUI_

最新资源