# 音频特征提取

> 原文：[`pytorch.org/audio/stable/tutorials/audio_feature_extractions_tutorial.html`](https://pytorch.org/audio/stable/tutorials/audio_feature_extractions_tutorial.html)
>
> 译者：[飞龙](https://github.com/wizardforcel)
>
> 协议：[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)


注意

点击这里下载完整示例代码

**作者**：Moto Hira

`torchaudio` 实现了在音频领域常用的特征提取。它们可以在 `torchaudio.functional` 和 `torchaudio.transforms` 中找到。

`functional` 将功能实现为独立的函数。它们是无状态的。

`transforms` 将功能实现为对象，使用来自 `functional` 和 `torch.nn.Module` 的实现。它们可以使用 TorchScript 进行序列化。

```py
import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T

print(torch.__version__)
print([torchaudio.__version__](https://docs.python.org/3/library/stdtypes.html#str "builtins.str"))

import librosa
import matplotlib.pyplot as plt 
```

```py
2.2.0
2.2.0 
```

## 音频特征概述

以下图表显示了常见音频特征与 torchaudio API 之间的关系，以生成它们。

![`download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png`](img/82ba49f78e3cd14b6e337acaf57b11e2.png)

有关可用功能的完整列表，请参阅文档。

## 准备工作

注意

在 Google Colab 中运行此教程时，请安装所需的软件包

```py
!pip install librosa 
```

```py
from IPython.display import Audio
from matplotlib.patches import Rectangle
from torchaudio.utils import download_asset

[torch.random.manual_seed](https://pytorch.org/docs/stable/generated/torch.manual_seed.html#torch.manual_seed "torch.manual_seed")(0)

[SAMPLE_SPEECH](https://docs.python.org/3/library/stdtypes.html#str "builtins.str") = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")

def plot_waveform(waveform, sr, title="Waveform", ax=None):
    waveform = waveform.numpy()

    num_channels, num_frames = waveform.shape
    time_axis = [torch.arange](https://pytorch.org/docs/stable/generated/torch.arange.html#torch.arange "torch.arange")(0, num_frames) / sr

    if ax is None:
        _, ax = plt.subplots(num_channels, 1)
    ax.plot(time_axis, waveform[0], linewidth=1)
    ax.grid(True)
    ax.set_xlim([0, time_axis[-1]])
    ax.set_title(title)

def plot_spectrogram(specgram, title=None, ylabel="freq_bin", ax=None):
    if ax is None:
        _, ax = plt.subplots(1, 1)
    if title is not None:
        ax.set_title(title)
    ax.set_ylabel(ylabel)
    ax.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto", interpolation="nearest")

def plot_fbank(fbank, title=None):
    fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(1, 1)
    [axs.set_title](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")(title or "Filter bank")
    [axs.imshow](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")(fbank, aspect="auto")
    [axs.set_ylabel](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")("frequency bin")
    [axs.set_xlabel](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")("mel bin") 
```

## 频谱图

要获取随时间变化的音频信号的频率构成，可以使用 `torchaudio.transforms.Spectrogram()`。

```py
# Load audio
[SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") = torchaudio.load([SAMPLE_SPEECH](https://docs.python.org/3/library/stdtypes.html#str "builtins.str"))

# Define transform
spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=512)

# Perform transform
[spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(2, 1)
plot_waveform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Original waveform", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0])
plot_spectrogram([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="spectrogram", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1])
fig.tight_layout() 
```

![原始波形，频谱图](img/787b5dbf919118f579d77973c5a30652.png)

```py
Audio([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy(), rate=[SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) 
```

您的浏览器不支持音频元素。

### `n_fft` 参数的影响

频谱图计算的核心是（短时）傅立叶变换，`n_fft` 参数对应于以下离散傅立叶变换定义中的 $N$。

$$ X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk} $$

（有关傅立叶变换的详细信息，请参阅[Wikipedia](https://en.wikipedia.org/wiki/Fast_Fourier_transform)。

`n_fft` 的值决定了频率轴的分辨率。然而，使用更高的 `n_fft` 值时，能量将分布在更多的箱中，因此在可视化时，它可能看起来更模糊，即使它们具有更高的分辨率。

以下是说明;

注意

`hop_length` 决定了时间轴的分辨率。默认情况下（即 `hop_length=None` 和 `win_length=None`），使用 `n_fft // 4` 的值。在这里，我们在不同的 `n_fft` 上使用相同的 `hop_length` 值，以便它们在时间轴上具有相同数量的元素。

```py
[n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list") = [32, 128, 512, 2048]
[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 64

[specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list") = []
for [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") in [n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"):
    spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"))
    [spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"))
    [specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list").append([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(len([specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list")), 1, sharex=True)
for [i](https://docs.python.org/3/library/functions.html#int "builtins.int"), ([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")) in enumerate(zip([specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"), [n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"))):
    plot_spectrogram([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel=f"n_fft={[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")}", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[[i](https://docs.python.org/3/library/functions.html#int "builtins.int")])
    [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[[i](https://docs.python.org/3/library/functions.html#int "builtins.int")].set_xlabel(None)
fig.tight_layout() 
```

![音频特征提取教程](img/ac72de68cdabfdc2ad8f166dcb01c27c.png)

在比较信号时，最好使用相同的采样率，但是如果必须使用不同的采样率，则必须小心解释 `n_fft` 的含义。回想一下，`n_fft` 决定了给定采样率的频率轴的分辨率。换句话说，频率轴上的每个箱代表的内容取决于采样率。

正如我们上面所看到的，改变 `n_fft` 的值并不会改变相同输入信号的频率范围的覆盖。

让我们对音频进行下采样，并使用相同的 `n_fft` 值应用频谱图。

```py
# Downsample to half of the original sample rate
[speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = torchaudio.functional.resample([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2)
# Upsample to the original sample rate
[speech3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = torchaudio.functional.resample([speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2, [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) 
```

```py
# Apply the same spectrogram
spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=512)

[spec0](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"))
[spec2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"))
[spec3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([speech3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
# Visualize it
fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(3, 1)
plot_spectrogram([spec0](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Original", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0])
[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0].add_patch(Rectangle((0, 3), 212, 128, edgecolor="r", facecolor="none"))
plot_spectrogram([spec2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Downsampled", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1])
plot_spectrogram([spec3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Upsampled", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[2])
fig.tight_layout() 
```

![音频特征提取教程](img/e78e0e1cf66866b1f0dd8dadbb0e8612.png)

在上述可视化中，第二个图（“下采样”）可能会给人一种频谱图被拉伸的印象。这是因为频率箱的含义与原始的不同。即使它们具有相同数量的箱，在第二个图中，频率仅覆盖到原始采样率的一半。如果我们再次对下采样信号进行重采样，使其具有与原始信号相同的采样率，这一点将变得更加清晰。

## GriffinLim

要从频谱图中恢复波形，可以使用`torchaudio.transforms.GriffinLim`。

必须使用与频谱图相同的参数集。

```py
# Define transforms
[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 1024
spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"))
griffin_lim = [T.GriffinLim](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"))

# Apply the transforms
[spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"))
[reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = griffin_lim([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
_, [axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(2, 1, sharex=True, sharey=True)
plot_waveform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Original", ax=[axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0])
plot_waveform([reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Reconstructed", ax=[axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1])
Audio([reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), rate=[SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) 
```

![原始，重建](img/32c66b5b578711def753dbb923cb7f66.png)

您的浏览器不支持音频元素。

## 梅尔滤波器组

`torchaudio.functional.melscale_fbanks()` 生成用于将频率箱转换为梅尔标度箱的滤波器组。

由于此函数不需要输入音频/特征，因此在`torchaudio.transforms()`中没有等效的转换。

```py
[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256
[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 64
[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") = 6000

[mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = F.melscale_fbanks(
    int([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2 + 1),
    [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    f_min=0.0,
    f_max=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") / 2.0,
    [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    norm="slaney",
) 
```

```py
plot_fbank([mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), "Mel Filter Bank - torchaudio") 
```

![梅尔滤波器组 - torchaudio](img/bd8afdf50a081e6142ab13cdd7cdbd51.png)

### 与 librosa 的比较

作为参考，这里是使用`librosa`获取梅尔滤波器组的等效方法。

```py
[mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.filters.mel(
    sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    fmin=0.0,
    fmax=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") / 2.0,
    norm="slaney",
    htk=True,
).T 
```

```py
plot_fbank([mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), "Mel Filter Bank - librosa")

[mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") - [mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item()
print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) 
```

![梅尔滤波器组 - librosa](img/0cf17a8f91bb1c63d22591f5bf2b7ccb.png)

```py
Mean Square Difference:  3.934872696751886e-17 
```

## 梅尔频谱图

生成梅尔标度频谱图涉及生成频谱图并执行梅尔标度转换。在`torchaudio`中，`torchaudio.transforms.MelSpectrogram()` 提供了这种功能。

```py
[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 1024
win_length = None
[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512
[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 128

mel_spectrogram = [T.MelSpectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")(
    [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    win_length=win_length,
    [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    center=True,
    pad_mode="reflect",
    power=2.0,
    norm="slaney",
    [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    mel_scale="htk",
)

[melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = mel_spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
plot_spectrogram([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0], title="MelSpectrogram - torchaudio", ylabel="mel freq") 
```

![梅尔频谱图 - torchaudio](img/3292985cec53c36aa443a745edd38599.png)

### 与 librosa 的比较

作为参考，这里是使用`librosa`生成梅尔标度频谱图的等效方法。

```py
[melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.melspectrogram(
    y=[SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy()[0],
    sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    win_length=win_length,
    center=True,
    pad_mode="reflect",
    power=2.0,
    [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    norm="slaney",
    htk=True,
) 
```

```py
plot_spectrogram([melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), title="MelSpectrogram - librosa", ylabel="mel freq")

[mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") - [melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item()
print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) 
```

![梅尔频谱图 - librosa](img/a38262177175977c17a412eacff0306e.png)

```py
Mean Square Difference:  1.2895221557229775e-09 
```

## MFCC

```py
[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 2048
win_length = None
[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512
[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256
[n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256

mfcc_transform = [T.MFCC](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")(
    [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    melkwargs={
        "n_fft": [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
        "n_mels": [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
        "hop_length": [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"),
        "mel_scale": "htk",
    },
)

[mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = mfcc_transform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

```py
plot_spectrogram([mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="MFCC") 
```

![MFCC](img/8453f28c3b04b95f3edf395b34622a94.png)

### 与 librosa 的比较

```py
[melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.melspectrogram(
    y=[SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy()[0],
    sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    win_length=win_length,
    [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    htk=True,
    norm=None,
)

[mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.[mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")(
    S=librosa.core.spectrum.power_to_db([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")),
    [n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    dct_type=2,
    norm="ortho",
) 
```

```py
plot_spectrogram([mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), title="MFCC (librosa)")

[mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") - [mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item()
print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) 
```

![MFCC (librosa)](img/94712cdb94274c1cacc312e88a22632c.png)

```py
Mean Square Difference:  0.8104011416435242 
```

## LFCC

```py
[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 2048
win_length = None
[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512
[n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256

lfcc_transform = [T.LFCC](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")(
    [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    [n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    speckwargs={
        "n_fft": [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"),
        "win_length": win_length,
        "hop_length": [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"),
    },
)

[lfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = lfcc_transform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"))
plot_spectrogram([lfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="LFCC") 
```

![LFCC](img/099b6c76722336c25d2347c22bb1022a.png)

## 音高

```py
[pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = F.detect_pitch_frequency([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) 
```

```py
def plot_pitch(waveform, sr, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")):
    figure, axis = plt.subplots(1, 1)
    axis.set_title("Pitch Feature")
    axis.grid(True)

    end_time = waveform.shape[1] / sr
    time_axis = [torch.linspace](https://pytorch.org/docs/stable/generated/torch.linspace.html#torch.linspace "torch.linspace")(0, end_time, waveform.shape[1])
    axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3)

    axis2 = axis.twinx()
    time_axis = [torch.linspace](https://pytorch.org/docs/stable/generated/torch.linspace.html#torch.linspace "torch.linspace")(0, end_time, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").shape[1])
    axis2.plot(time_axis, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], linewidth=2, label="Pitch", color="green")

    axis2.legend(loc=0)

plot_pitch([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) 
```

![音高特征](img/1f2e9b4055fe894039bd0dc8f5a98bc2.png)

**脚本的总运行时间：**（0 分钟 9.372 秒）

`下载 Python 源代码：audio_feature_extractions_tutorial.py`

`下载 Jupyter 笔记本：audio_feature_extractions_tutorial.ipynb`

[Sphinx-Gallery 生成的画廊](https://sphinx-gallery.github.io)