# 音频特征提取 > 原文:[`pytorch.org/audio/stable/tutorials/audio_feature_extractions_tutorial.html`](https://pytorch.org/audio/stable/tutorials/audio_feature_extractions_tutorial.html) > > 译者:[飞龙](https://github.com/wizardforcel) > > 协议:[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) 注意 点击这里下载完整示例代码 **作者**:Moto Hira `torchaudio` 实现了在音频领域常用的特征提取。它们可以在 `torchaudio.functional` 和 `torchaudio.transforms` 中找到。 `functional` 将功能实现为独立的函数。它们是无状态的。 `transforms` 将功能实现为对象,使用来自 `functional` 和 `torch.nn.Module` 的实现。它们可以使用 TorchScript 进行序列化。 ```py import torch import torchaudio import torchaudio.functional as F import torchaudio.transforms as T print(torch.__version__) print([torchaudio.__version__](https://docs.python.org/3/library/stdtypes.html#str "builtins.str")) import librosa import matplotlib.pyplot as plt ``` ```py 2.2.0 2.2.0 ``` ## 音频特征概述 以下图表显示了常见音频特征与 torchaudio API 之间的关系,以生成它们。 ![`download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png`](img/82ba49f78e3cd14b6e337acaf57b11e2.png) 有关可用功能的完整列表,请参阅文档。 ## 准备工作 注意 在 Google Colab 中运行此教程时,请安装所需的软件包 ```py !pip install librosa ``` ```py from IPython.display import Audio from matplotlib.patches import Rectangle from torchaudio.utils import download_asset [torch.random.manual_seed](https://pytorch.org/docs/stable/generated/torch.manual_seed.html#torch.manual_seed "torch.manual_seed")(0) [SAMPLE_SPEECH](https://docs.python.org/3/library/stdtypes.html#str "builtins.str") = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") def plot_waveform(waveform, sr, title="Waveform", ax=None): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = [torch.arange](https://pytorch.org/docs/stable/generated/torch.arange.html#torch.arange "torch.arange")(0, num_frames) / sr if ax is None: _, ax = plt.subplots(num_channels, 1) ax.plot(time_axis, waveform[0], linewidth=1) ax.grid(True) ax.set_xlim([0, time_axis[-1]]) ax.set_title(title) def plot_spectrogram(specgram, title=None, ylabel="freq_bin", ax=None): if ax is None: _, ax = plt.subplots(1, 1) if title is not None: ax.set_title(title) ax.set_ylabel(ylabel) ax.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto", interpolation="nearest") def plot_fbank(fbank, title=None): fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(1, 1) [axs.set_title](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")(title or "Filter bank") [axs.imshow](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")(fbank, aspect="auto") [axs.set_ylabel](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")("frequency bin") [axs.set_xlabel](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")("mel bin") ``` ## 频谱图 要获取随时间变化的音频信号的频率构成,可以使用 `torchaudio.transforms.Spectrogram()`。 ```py # Load audio [SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") = torchaudio.load([SAMPLE_SPEECH](https://docs.python.org/3/library/stdtypes.html#str "builtins.str")) # Define transform spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=512) # Perform transform [spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(2, 1) plot_waveform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Original waveform", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0]) plot_spectrogram([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="spectrogram", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1]) fig.tight_layout() ``` ![原始波形,频谱图](img/787b5dbf919118f579d77973c5a30652.png) ```py Audio([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy(), rate=[SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) ``` 您的浏览器不支持音频元素。 ### `n_fft` 参数的影响 频谱图计算的核心是(短时)傅立叶变换,`n_fft` 参数对应于以下离散傅立叶变换定义中的 $N$。 $$ X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk} $$ (有关傅立叶变换的详细信息,请参阅[Wikipedia](https://en.wikipedia.org/wiki/Fast_Fourier_transform)。 `n_fft` 的值决定了频率轴的分辨率。然而,使用更高的 `n_fft` 值时,能量将分布在更多的箱中,因此在可视化时,它可能看起来更模糊,即使它们具有更高的分辨率。 以下是说明; 注意 `hop_length` 决定了时间轴的分辨率。默认情况下(即 `hop_length=None` 和 `win_length=None`),使用 `n_fft // 4` 的值。在这里,我们在不同的 `n_fft` 上使用相同的 `hop_length` 值,以便它们在时间轴上具有相同数量的元素。 ```py [n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list") = [32, 128, 512, 2048] [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 64 [specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list") = [] for [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") in [n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"): spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")) [spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) [specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list").append([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(len([specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list")), 1, sharex=True) for [i](https://docs.python.org/3/library/functions.html#int "builtins.int"), ([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")) in enumerate(zip([specs](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"), [n_ffts](https://docs.python.org/3/library/stdtypes.html#list "builtins.list"))): plot_spectrogram([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel=f"n_fft={[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")}", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[[i](https://docs.python.org/3/library/functions.html#int "builtins.int")]) [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[[i](https://docs.python.org/3/library/functions.html#int "builtins.int")].set_xlabel(None) fig.tight_layout() ``` ![音频特征提取教程](img/ac72de68cdabfdc2ad8f166dcb01c27c.png) 在比较信号时,最好使用相同的采样率,但是如果必须使用不同的采样率,则必须小心解释 `n_fft` 的含义。回想一下,`n_fft` 决定了给定采样率的频率轴的分辨率。换句话说,频率轴上的每个箱代表的内容取决于采样率。 正如我们上面所看到的,改变 `n_fft` 的值并不会改变相同输入信号的频率范围的覆盖。 让我们对音频进行下采样,并使用相同的 `n_fft` 值应用频谱图。 ```py # Downsample to half of the original sample rate [speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = torchaudio.functional.resample([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2) # Upsample to the original sample rate [speech3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = torchaudio.functional.resample([speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2, [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) ``` ```py # Apply the same spectrogram spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=512) [spec0](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) [spec2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([speech2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) [spec3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([speech3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py # Visualize it fig, [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(3, 1) plot_spectrogram([spec0](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Original", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0]) [axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0].add_patch(Rectangle((0, 3), 212, 128, edgecolor="r", facecolor="none")) plot_spectrogram([spec2](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Downsampled", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1]) plot_spectrogram([spec3](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], ylabel="Upsampled", ax=[axs](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[2]) fig.tight_layout() ``` ![音频特征提取教程](img/e78e0e1cf66866b1f0dd8dadbb0e8612.png) 在上述可视化中,第二个图(“下采样”)可能会给人一种频谱图被拉伸的印象。这是因为频率箱的含义与原始的不同。即使它们具有相同数量的箱,在第二个图中,频率仅覆盖到原始采样率的一半。如果我们再次对下采样信号进行重采样,使其具有与原始信号相同的采样率,这一点将变得更加清晰。 ## GriffinLim 要从频谱图中恢复波形,可以使用`torchaudio.transforms.GriffinLim`。 必须使用与频谱图相同的参数集。 ```py # Define transforms [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 1024 spectrogram = [T.Spectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")) griffin_lim = [T.GriffinLim](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")) # Apply the transforms [spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) [reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = griffin_lim([spec](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py _, [axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = plt.subplots(2, 1, sharex=True, sharey=True) plot_waveform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Original", ax=[axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0]) plot_waveform([reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), title="Reconstructed", ax=[axes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[1]) Audio([reconstructed_waveform](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), rate=[SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) ``` ![原始,重建](img/32c66b5b578711def753dbb923cb7f66.png) 您的浏览器不支持音频元素。 ## 梅尔滤波器组 `torchaudio.functional.melscale_fbanks()` 生成用于将频率箱转换为梅尔标度箱的滤波器组。 由于此函数不需要输入音频/特征,因此在`torchaudio.transforms()`中没有等效的转换。 ```py [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256 [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 64 [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") = 6000 [mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = F.melscale_fbanks( int([n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") // 2 + 1), [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), f_min=0.0, f_max=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") / 2.0, [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), norm="slaney", ) ``` ```py plot_fbank([mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), "Mel Filter Bank - torchaudio") ``` ![梅尔滤波器组 - torchaudio](img/bd8afdf50a081e6142ab13cdd7cdbd51.png) ### 与 librosa 的比较 作为参考,这里是使用`librosa`获取梅尔滤波器组的等效方法。 ```py [mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.filters.mel( sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), fmin=0.0, fmax=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int") / 2.0, norm="slaney", htk=True, ).T ``` ```py plot_fbank([mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), "Mel Filter Bank - librosa") [mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([mel_filters](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") - [mel_filters_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item() print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) ``` ![梅尔滤波器组 - librosa](img/0cf17a8f91bb1c63d22591f5bf2b7ccb.png) ```py Mean Square Difference: 3.934872696751886e-17 ``` ## 梅尔频谱图 生成梅尔标度频谱图涉及生成频谱图并执行梅尔标度转换。在`torchaudio`中,`torchaudio.transforms.MelSpectrogram()` 提供了这种功能。 ```py [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 1024 win_length = None [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512 [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 128 mel_spectrogram = [T.MelSpectrogram](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")( [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), win_length=win_length, [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"), center=True, pad_mode="reflect", power=2.0, norm="slaney", [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), mel_scale="htk", ) [melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = mel_spectrogram([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py plot_spectrogram([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")[0], title="MelSpectrogram - torchaudio", ylabel="mel freq") ``` ![梅尔频谱图 - torchaudio](img/3292985cec53c36aa443a745edd38599.png) ### 与 librosa 的比较 作为参考,这里是使用`librosa`生成梅尔标度频谱图的等效方法。 ```py [melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.melspectrogram( y=[SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy()[0], sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"), win_length=win_length, center=True, pad_mode="reflect", power=2.0, [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), norm="slaney", htk=True, ) ``` ```py plot_spectrogram([melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), title="MelSpectrogram - librosa", ylabel="mel freq") [mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") - [melspec_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item() print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) ``` ![梅尔频谱图 - librosa](img/a38262177175977c17a412eacff0306e.png) ```py Mean Square Difference: 1.2895221557229775e-09 ``` ## MFCC ```py [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 2048 win_length = None [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512 [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256 [n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256 mfcc_transform = [T.MFCC](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")( [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"), melkwargs={ "n_fft": [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), "n_mels": [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), "hop_length": [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"), "mel_scale": "htk", }, ) [mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = mfcc_transform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ```py plot_spectrogram([mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="MFCC") ``` ![MFCC](img/8453f28c3b04b95f3edf395b34622a94.png) ### 与 librosa 的比较 ```py [melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.melspectrogram( y=[SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").numpy()[0], sr=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), win_length=win_length, [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int")=[hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mels](https://docs.python.org/3/library/functions.html#int "builtins.int"), htk=True, norm=None, ) [mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray") = librosa.feature.[mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")( S=librosa.core.spectrum.power_to_db([melspec](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")), [n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_mfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"), dct_type=2, norm="ortho", ) ``` ```py plot_spectrogram([mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray"), title="MFCC (librosa)") [mse](https://docs.python.org/3/library/functions.html#float "builtins.float") = [torch.square](https://pytorch.org/docs/stable/generated/torch.square.html#torch.square "torch.square")([mfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") - [mfcc_librosa](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray "numpy.ndarray")).mean().item() print("Mean Square Difference: ", [mse](https://docs.python.org/3/library/functions.html#float "builtins.float")) ``` ![MFCC (librosa)](img/94712cdb94274c1cacc312e88a22632c.png) ```py Mean Square Difference: 0.8104011416435242 ``` ## LFCC ```py [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int") = 2048 win_length = None [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int") = 512 [n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int") = 256 lfcc_transform = [T.LFCC](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module "torch.nn.Module")( [sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int")=[sample_rate](https://docs.python.org/3/library/functions.html#int "builtins.int"), [n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int")=[n_lfcc](https://docs.python.org/3/library/functions.html#int "builtins.int"), speckwargs={ "n_fft": [n_fft](https://docs.python.org/3/library/functions.html#int "builtins.int"), "win_length": win_length, "hop_length": [hop_length](https://docs.python.org/3/library/functions.html#int "builtins.int"), }, ) [lfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = lfcc_transform([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) plot_spectrogram([lfcc](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], title="LFCC") ``` ![LFCC](img/099b6c76722336c25d2347c22bb1022a.png) ## 音高 ```py [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor") = F.detect_pitch_frequency([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int")) ``` ```py def plot_pitch(waveform, sr, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")): figure, axis = plt.subplots(1, 1) axis.set_title("Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = [torch.linspace](https://pytorch.org/docs/stable/generated/torch.linspace.html#torch.linspace "torch.linspace")(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) axis2 = axis.twinx() time_axis = [torch.linspace](https://pytorch.org/docs/stable/generated/torch.linspace.html#torch.linspace "torch.linspace")(0, end_time, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor").shape[1]) axis2.plot(time_axis, [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")[0], linewidth=2, label="Pitch", color="green") axis2.legend(loc=0) plot_pitch([SPEECH_WAVEFORM](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor"), [SAMPLE_RATE](https://docs.python.org/3/library/functions.html#int "builtins.int"), [pitch](https://pytorch.org/docs/stable/tensors.html#torch.Tensor "torch.Tensor")) ``` ![音高特征](img/1f2e9b4055fe894039bd0dc8f5a98bc2.png) **脚本的总运行时间:**(0 分钟 9.372 秒) `下载 Python 源代码:audio_feature_extractions_tutorial.py` `下载 Jupyter 笔记本:audio_feature_extractions_tutorial.ipynb` [Sphinx-Gallery 生成的画廊](https://sphinx-gallery.github.io)