- en: References
  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 参考文献
- en: 原文：[https://pytorch.org/audio/stable/references.html](https://pytorch.org/audio/stable/references.html)
  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
  zh: '[https://pytorch.org/audio/stable/references.html](https://pytorch.org/audio/stable/references.html)'
- en: '[Yes]'
  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
  zh: '[Yes]'
- en: 'Yesno. URL: [http://www.openslr.org/1/](http://www.openslr.org/1/).'
  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
  zh: Yesno。网址：[http://www.openslr.org/1/](http://www.openslr.org/1/)。
- en: '[AB79]'
  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
  zh: '[AB79]'
- en: Jont B Allen and David A Berkley. Image method for efficiently simulating small-room
    acoustics. *The Journal of the Acoustical Society of America*, 65(4):943–950,
    1979.
  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
  zh: Jont B Allen和David A Berkley。用于高效模拟小房间声学的图像方法。*美国声学学会杂志*，65(4)：943-950，1979年。
- en: '[ABD+20]'
  id: totrans-6
  prefs: []
  type: TYPE_NORMAL
  zh: '[ABD+20]'
- en: 'Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler,
    Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber.
    Common voice: a massively-multilingual speech corpus. 2020\. [arXiv:1912.06670](https://arxiv.org/abs/1912.06670).'
  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
  zh: Rosana Ardila，Megan Branson，Kelly Davis，Michael Henretty，Michael Kohler，Josh
    Meyer，Reuben Morais，Lindsay Saunders，Francis M. Tyers和Gregor Weber。Common voice：一个大规模多语言语音语料库。2020年。[arXiv:1912.06670](https://arxiv.org/abs/1912.06670)。
- en: '[BWT+21]'
  id: totrans-8
  prefs: []
  type: TYPE_NORMAL
  zh: '[BWT+21]'
- en: 'Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman
    Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, and others.
    Xls-r: self-supervised cross-lingual speech representation learning at scale.
    *arXiv preprint arXiv:2111.09296*, 2021.'
  id: totrans-9
  prefs: []
  type: TYPE_NORMAL
  zh: Arun Babu，王长翰，Andros Tjandra，Kushal Lakhotia，徐前通，Naman Goyal，Kritika Singh，Patrick
    von Platen，Yatharth Saraf，Juan Pino等人。Xls-r：规模化的自监督跨语言语音表示学习。*arXiv预印本arXiv:2111.09296*，2021年。
- en: '[BZMA20]'
  id: totrans-10
  prefs: []
  type: TYPE_NORMAL
  zh: '[BZMA20]'
- en: 'Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec
    2.0: a framework for self-supervised learning of speech representations. 2020\.
    [arXiv:2006.11477](https://arxiv.org/abs/2006.11477).'
  id: totrans-11
  prefs: []
  type: TYPE_NORMAL
  zh: Alexei Baevski，Henry Zhou，Abdelrahman Mohamed和Michael Auli。Wav2vec 2.0：一种用于自监督学习语音表示的框架。2020年。[arXiv:2006.11477](https://arxiv.org/abs/2006.11477)。
- en: '[BBL+08]'
  id: totrans-12
  prefs: []
  type: TYPE_NORMAL
  zh: '[BBL+08]'
- en: 'Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost,
    Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: interactive
    emotional dyadic motion capture database. *Language Resources and Evaluation*,
    42:335–359, 12 2008\. [doi:10.1007/s10579-008-9076-6](https://doi.org/10.1007/s10579-008-9076-6).'
  id: totrans-13
  prefs: []
  type: TYPE_NORMAL
  zh: Carlos Busso，Murtaza Bulut，李志俊，Abe Kazemzadeh，Emily Mower Provost，Samuel Kim，Jeannette
    Chang，李成博，Shrikanth Narayanan。Iemocap：交互式情感二元动作捕捉数据库。*语言资源与评估*，42：335-359，2008年12月。[doi:10.1007/s10579-008-9076-6](https://doi.org/10.1007/s10579-008-9076-6)。
- en: '[Cap69]'
  id: totrans-14
  prefs: []
  type: TYPE_NORMAL
  zh: '[Cap69]'
- en: Jack Capon. High-resolution frequency-wavenumber spectrum analysis. *Proceedings
    of the IEEE*, 57(8):1408–1418, 1969.
  id: totrans-15
  prefs: []
  type: TYPE_NORMAL
  zh: Jack Capon。高分辨率频率-波数谱分析。*IEEE会议论文集*，57(8)：1408-1418，1969年。
- en: '[CDiGangiB+21]'
  id: totrans-16
  prefs: []
  type: TYPE_NORMAL
  zh: '[CDiGangiB+21]'
- en: 'Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri,
    and Marco Turchi. Must-c: a multilingual corpus for end-to-end speech translation.
    *Computer Speech & Language*, 66:101155, 2021\. URL: [https://www.sciencedirect.com/science/article/pii/S0885230820300887](https://www.sciencedirect.com/science/article/pii/S0885230820300887),
    [doi:https://doi.org/10.1016/j.csl.2020.101155](https://doi.org/https://doi.org/10.1016/j.csl.2020.101155).'
  id: totrans-17
  prefs: []
  type: TYPE_NORMAL
  zh: Roldano Cattoni，Mattia Antonino Di Gangi，Luisa Bentivogli，Matteo Negri和Marco
    Turchi。Must-c：用于端到端语音翻译的多语言语料库。*计算机语音与语言*，66：101155，2021年。网址：[https://www.sciencedirect.com/science/article/pii/S0885230820300887](https://www.sciencedirect.com/science/article/pii/S0885230820300887)，[doi:https://doi.org/10.1016/j.csl.2020.101155](https://doi.org/https://doi.org/10.1016/j.csl.2020.101155)。
- en: '[CCW+21]'
  id: totrans-18
  prefs: []
  type: TYPE_NORMAL
  zh: '[CCW+21]'
- en: 'Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng,
    Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur,
    Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
    Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr
    corpus with 10,000 hours of transcribed audio. In *Proc. Interspeech 2021*. 2021.'
  id: totrans-19
  prefs: []
  type: TYPE_NORMAL
  zh: Guoguo Chen，柴树洲，王冠波，杜佳宇，张伟强，翁超，苏丹，Daniel Povey，Jan Trmal，张俊博，金明杰，Sanjeev Khudanpur，Shinji
    Watanabe，赵帅江，邹伟，李相刚，姚旭晨，王永庆，王玉军，尤赵，严志勇。Gigaspeech：一个不断发展的、多领域的带有10000小时转录音频的自动语音识别语料库。在*Interspeech
    2021*会议上。2021年。
- en: '[CWC+22]'
  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
  zh: '[CWC+22]'
- en: 'Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu
    Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and others. Wavlm: large-scale
    self-supervised pre-training for full stack speech processing. *IEEE Journal of
    Selected Topics in Signal Processing*, 16(6):1505–1518, 2022.'
  id: totrans-21
  prefs: []
  type: TYPE_NORMAL
  zh: 三元陈，程毅王，正阳陈，宇吴，树杰刘，卓陈，金宇李，神行祥，吉田直之，吉冈拓也，肖雄，等人。Wavlm：用于全栈语音处理的大规模自监督预训练。*IEEE信号处理领域选题杂志*，16(6)：1505-1518，2022年。
- en: '[CPS16]'
  id: totrans-22
  prefs: []
  type: TYPE_NORMAL
  zh: '[CPS16]'
- en: 'Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end
    convnet-based speech recognition system. 2016\. [arXiv:1609.03193](https://arxiv.org/abs/1609.03193).'
  id: totrans-23
  prefs: []
  type: TYPE_NORMAL
  zh: Ronan Collobert，Christian Puhrsch和Gabriel Synnaeve。Wav2letter：一种端到端的基于卷积神经网络的语音识别系统。2016年。[arXiv:1609.03193](https://arxiv.org/abs/1609.03193)。
- en: '[CBC+20]'
  id: totrans-24
  prefs: []
  type: TYPE_NORMAL
  zh: '[CBC+20]'
- en: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael
    Auli. Unsupervised cross-lingual representation learning for speech recognition.
    2020\. [arXiv:2006.13979](https://arxiv.org/abs/2006.13979).
  id: totrans-25
  prefs: []
  type: TYPE_NORMAL
  zh: Alexis Conneau，Alexei Baevski，Ronan Collobert，Abdelrahman Mohamed和Michael Auli。用于语音识别的无监督跨语言表示学习。2020年。[arXiv:2006.13979](https://arxiv.org/abs/2006.13979)。
- en: '[CY21]'
  id: totrans-26
  prefs: []
  type: TYPE_NORMAL
  zh: '[CY21]'
- en: Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis
    challenges compare today? *arXiv preprint arXiv:2105.02373*, 2021.
  id: totrans-27
  prefs: []
  type: TYPE_NORMAL
  zh: Erica Cooper和山岸纯一。过去语音合成挑战中的声音如何与今天相比？*arXiv预印本arXiv:2105.02373*，2021年。
- en: '[CPC+20]'
  id: totrans-28
  prefs: []
  type: TYPE_NORMAL
  zh: '[CPC+20]'
- en: 'Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel
    Vincent. Librimix: an open-source dataset for generalizable speech separation.
    2020\. [arXiv:2005.11262](https://arxiv.org/abs/2005.11262).'
  id: totrans-29
  prefs: []
  type: TYPE_NORMAL
  zh: Joris Cosentino，Manuel Pariente，Samuele Cornell，Antoine Deleforge和Emmanuel Vincent。Librimix：一个用于通用语音分离的开源数据集。2020年。[arXiv:2005.11262](https://arxiv.org/abs/2005.11262)。
- en: '[CSB+18]'
  id: totrans-30
  prefs: []
  type: TYPE_NORMAL
  zh: '[CSB+18]'
- en: 'Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier,
    David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut
    Lavril, and others. Snips voice platform: an embedded spoken language understanding
    system for private-by-design voice interfaces. *arXiv preprint arXiv:1805.10190*,
    2018.'
  id: totrans-31
  prefs: []
  type: TYPE_NORMAL
  zh: Alice Coucke，Alaa Saade，Adrien Ball，Théodore Bluche，Alexandre Caulier，David
    Leroy，Clément Doumouro，Thibault Gisselbrecht，Francesco Caltagirone，Thibaut Lavril等人。Snips语音平台：一种用于私密设计语音界面的嵌入式口语理解系统。*arXiv预印本arXiv:1805.10190*，2018年。
- en: '[DL82]'
  id: totrans-32
  prefs: []
  type: TYPE_NORMAL
  zh: '[DL82]'
- en: DC Dowson and BV666017 Landau. The fréchet distance between multivariate normal
    distributions. *Journal of multivariate analysis*, 12(3):450–455, 1982.
  id: totrans-33
  prefs: []
  type: TYPE_NORMAL
  zh: DC道森和BV666017兰道。多元正态分布之间的弗雷歇距离。*多元分析杂志*，12(3)：450-455，1982年。
- en: '[Defossez21]'
  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
  zh: '[Defossez21]'
- en: Alexandre Défossez. Hybrid spectrogram and waveform source separation. In *Proceedings
    of the ISMIR 2021 Workshop on Music Source Separation*. 2021.
  id: totrans-35
  prefs: []
  type: TYPE_NORMAL
  zh: 亚历山大·德福塞。混合谱图和波形源分离。在*ISMIR 2021音乐源分离研讨会论文集*中。2021年。
- en: '[GKRR14]'
  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
  zh: '[GKRR14]'
- en: 'Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech
    recognition and keyword spotting for low-resource languages: babel project research
    at cued. In *SLTU*. 2014.'
  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
  zh: 马克·约翰·弗朗西斯·盖尔斯、凯特·尼尔、安东·拉格尼和沙克提·普拉萨德·拉特。低资源语言的语音识别和关键词检测：剑桥大学babel项目研究。在*SLTU*中。2014年。
- en: '[Gra12]'
  id: totrans-38
  prefs: []
  type: TYPE_NORMAL
  zh: '[Gra12]'
- en: Alex Graves. Sequence transduction with recurrent neural networks. 2012\. [arXiv:1211.3711](https://arxiv.org/abs/1211.3711).
  id: totrans-39
  prefs: []
  type: TYPE_NORMAL
  zh: 亚历克斯·格雷夫斯。使用递归神经网络进行序列转导。2012年。[arXiv:1211.3711](https://arxiv.org/abs/1211.3711)。
- en: '[GL83]'
  id: totrans-40
  prefs: []
  type: TYPE_NORMAL
  zh: '[GL83]'
- en: D. Griffin and Jae Lim. Signal estimation from modified short-time fourier transform.
    In *ICASSP '83\. IEEE International Conference on Acoustics, Speech, and Signal
    Processing*, volume 8, 804–807\. 1983\. [doi:10.1109/ICASSP.1983.1172092](https://doi.org/10.1109/ICASSP.1983.1172092).
  id: totrans-41
  prefs: []
  type: TYPE_NORMAL
  zh: D.格里芬和林杰。从修改后的短时傅里叶变换中估计信号。在*ICASSP '83。IEEE国际声学、语音和信号处理会议*中，卷8，804-807。1983年。[doi:10.1109/ICASSP.1983.1172092](https://doi.org/10.1109/ICASSP.1983.1172092)。
- en: '[GQC+20]'
  id: totrans-42
  prefs: []
  type: TYPE_NORMAL
  zh: '[GQC+20]'
- en: 'Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu,
    Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer:
    convolution-augmented transformer for speech recognition. 2020\. [arXiv:2005.08100](https://arxiv.org/abs/2005.08100).'
  id: totrans-43
  prefs: []
  type: TYPE_NORMAL
  zh: 安莫尔·古拉蒂、詹姆斯·秦、邱中成、尼基·帕马尔、张宇、余佳辉、韩伟、王世博、张正东、吴永辉和庞若明。Conformer：用于语音识别的卷积增强变压器。2020年。[arXiv:2005.08100](https://arxiv.org/abs/2005.08100)。
- en: '[HCC+14]'
  id: totrans-44
  prefs: []
  type: TYPE_NORMAL
  zh: '[HCC+14]'
- en: 'Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,
    Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng.
    Deep speech: scaling up end-to-end speech recognition. 2014\. [arXiv:1412.5567](https://arxiv.org/abs/1412.5567).'
  id: totrans-45
  prefs: []
  type: TYPE_NORMAL
  zh: 奥尼·汉农、卡尔·凯斯、贾里德·卡斯珀、布莱恩·卡坦扎罗、格雷格·迪阿莫斯、埃里希·埃尔森、瑞安·普伦格、桑杰夫·萨蒂什、舒博·森古普塔、亚当·科茨和安德鲁·Y.
    吴。深度语音：扩展端到端语音识别。2014年。[arXiv:1412.5567](https://arxiv.org/abs/1412.5567)。
- en: '[HCE+17]'
  id: totrans-46
  prefs: []
  type: TYPE_NORMAL
  zh: '[HCE+17]'
- en: 'Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren
    Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold,
    Malcolm Slaney, Ron Weiss, and Kevin Wilson. Cnn architectures for large-scale
    audio classification. In *International Conference on Acoustics, Speech and Signal
    Processing (ICASSP)*. 2017\. URL: [https://arxiv.org/abs/1609.09430](https://arxiv.org/abs/1609.09430).'
  id: totrans-47
  prefs: []
  type: TYPE_NORMAL
  zh: 肖恩·赫尔希、索里什·乔杜里、丹尼尔·P. W. 艾利斯、约特·F. 格梅克、阿伦·詹森、查宁·摩尔、马诺杰·普拉卡尔、德文·普拉特、里夫·A. 索罗斯、布莱恩·塞伯尔德、马尔科姆·斯兰尼、罗恩·韦斯和凯文·威尔逊。用于大规模音频分类的CNN架构。在*国际声学、语音和信号处理会议（ICASSP）*中。2017年。网址：[https://arxiv.org/abs/1609.09430](https://arxiv.org/abs/1609.09430)。
- en: '[HIA+17]'
  id: totrans-48
  prefs: []
  type: TYPE_NORMAL
  zh: '[HIA+17]'
- en: Takuya Higuchi, Nobutaka Ito, Shoko Araki, Takuya Yoshioka, Marc Delcroix, and
    Tomohiro Nakatani. Online mvdr beamformer based on complex gaussian mixture model
    with spatial prior for noise robust asr. *IEEE/ACM Transactions on Audio, Speech,
    and Language Processing*, 25(4):780–793, 2017.
  id: totrans-49
  prefs: []
  type: TYPE_NORMAL
  zh: 樋口拓也、伊藤伸孝、荒木祥子、吉冈拓也、马克·德尔克罗伊和中谷智博。基于复高斯混合模型的在线mvdr波束形成器，具有空间先验用于噪声鲁棒的asr。*IEEE/ACM音频、语音和语言处理交易*，25(4)：780-793，2017年。
- en: '[HIYN16]'
  id: totrans-50
  prefs: []
  type: TYPE_NORMAL
  zh: '[HIYN16]'
- en: Takuya Higuchi, Nobutaka Ito, Takuya Yoshioka, and Tomohiro Nakatani. Robust
    mvdr beamforming using time-frequency masks for online/offline asr in noise. In
    *2016 IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP)*, 5210–5214\. IEEE, 2016.
  id: totrans-51
  prefs: []
  type: TYPE_NORMAL
  zh: 樋口拓也、伊藤伸孝、吉冈拓也和中谷智博。使用时频掩模进行在线/离线噪声下的鲁棒mvdr波束形成。在*2016年IEEE国际声学、语音和信号处理会议（ICASSP）*中，5210-5214。IEEE，2016年。
- en: '[HBT+21]'
  id: totrans-52
  prefs: []
  type: TYPE_NORMAL
  zh: '[HBT+21]'
- en: 'Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
    Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation
    learning by masked prediction of hidden units. 2021\. [arXiv:2106.07447](https://arxiv.org/abs/2106.07447).'
  id: totrans-53
  prefs: []
  type: TYPE_NORMAL
  zh: 徐伟宁、本杰明·博尔特、蔡耀宏、库沙尔·拉克霍蒂亚、鲁斯兰·萨拉胡特迪诺夫和阿卜杜勒拉曼·穆罕默德。Hubert：通过隐藏单元的掩码预测进行自监督语音表示学习。2021年。[arXiv:2106.07447](https://arxiv.org/abs/2106.07447)。
- en: '[IJ17]'
  id: totrans-54
  prefs: []
  type: TYPE_NORMAL
  zh: '[IJ17]'
- en: Keith Ito and Linda Johnson. The lj speech dataset. [https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-Dataset/),
    2017.
  id: totrans-55
  prefs: []
  type: TYPE_NORMAL
  zh: 基思伊托和琳达约翰逊。LJ语音数据集。[https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-Dataset/)，2017年。
- en: '[KPL+22]'
  id: totrans-56
  prefs: []
  type: TYPE_NORMAL
  zh: '[KPL+22]'
- en: 'Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun,
    Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, and others. Flashlight:
    enabling innovation in tools for machine learning. *arXiv preprint arXiv:2201.12465*,
    2022.'
  id: totrans-57
  prefs: []
  type: TYPE_NORMAL
  zh: 雅各布·卡恩、维尼尔·普拉塔普、塔蒂亚娜·利霍马年科、钱通徐、奥尼·汉农、杰夫·凯、帕登·托马塞洛、安·李、埃杜瓦·格雷夫、吉拉德·阿维多夫等。Flashlight：为机器学习工具创新提供支持。*arXiv预印本arXiv:2201.12465*，2022年。
- en: '[KES+18a]'
  id: totrans-58
  prefs: []
  type: TYPE_NORMAL
  zh: '[KES+18a]'
- en: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande,
    Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray
    Kavukcuoglu. Efficient neural audio synthesis. 2018\. [arXiv:1802.08435](https://arxiv.org/abs/1802.08435).
  id: totrans-59
  prefs: []
  type: TYPE_NORMAL
  zh: 纳尔·卡尔布伦纳、埃里希·埃尔森、卡伦·西蒙扬、塞布·努里、诺曼·卡萨格兰德、爱德华·洛克哈特、弗洛里安·斯蒂姆伯格、亚伦·范登·奥尔德、桑德·迪勒曼和科雷·卡武克乔格卢。高效的神经音频合成。2018年。[arXiv:1802.08435](https://arxiv.org/abs/1802.08435)。
- en: '[KES+18b]'
  id: totrans-60
  prefs: []
  type: TYPE_NORMAL
  zh: '[KES+18b]'
- en: 'Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande,
    Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray
    Kavukcuoglu. Efficient neural audio synthesis. *CoRR*, 2018\. URL: [http://arxiv.org/abs/1802.08435](http://arxiv.org/abs/1802.08435),
    [arXiv:1802.08435](https://arxiv.org/abs/1802.08435).'
  id: totrans-61
  prefs: []
  type: TYPE_NORMAL
  zh: 纳尔·卡尔布伦纳、埃里希·埃尔森、卡伦·西蒙扬、塞布·努里、诺曼·卡萨格兰德、爱德华·洛克哈特、弗洛里安·斯蒂姆伯格、阿伦·范登·奥尔德、桑德·迪勒曼和科雷·卡武克乔格卢。高效的神经音频合成。*CoRR*，2018年。网址：[http://arxiv.org/abs/1802.08435](http://arxiv.org/abs/1802.08435)，[arXiv:1802.08435](https://arxiv.org/abs/1802.08435)。
- en: '[KPPK15]'
  id: totrans-62
  prefs: []
  type: TYPE_NORMAL
  zh: '[KPPK15]'
- en: Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation
    for speech recognition. In *Proc. Interspeech 2015*, 3586–3589\. 2015\. [doi:10.21437/Interspeech.2015-711](https://doi.org/10.21437/Interspeech.2015-711).
  id: totrans-63
  prefs: []
  type: TYPE_NORMAL
  zh: Tom Ko，Vijayaditya Peddinti，Daniel Povey和Sanjeev Khudanpur。用于语音识别的音频增强。在*Interspeech
    2015会议论文集*中，3586-3589。2015年。[doi:10.21437/Interspeech.2015-711](https://doi.org/10.21437/Interspeech.2015-711)。
- en: '[KBV03]'
  id: totrans-64
  prefs: []
  type: TYPE_NORMAL
  zh: '[KBV03]'
- en: John Kominek, Alan W Black, and Ver Ver. Cmu arctic databases for speech synthesis.
    Technical Report, 2003.
  id: totrans-65
  prefs: []
  type: TYPE_NORMAL
  zh: John Kominek，Alan W Black和Ver Ver。用于语音合成的CMU北极数据库。技术报告，2003年。
- en: '[KKB20]'
  id: totrans-66
  prefs: []
  type: TYPE_NORMAL
  zh: '[KKB20]'
- en: 'Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: generative adversarial
    networks for efficient and high fidelity speech synthesis. In H. Larochelle, M. Ranzato,
    R. Hadsell, M.F. Balcan, and H. Lin, editors, *Advances in Neural Information
    Processing Systems*, volume 33, 17022–17033\. Curran Associates, Inc., 2020\.
    URL: [https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf](https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf).'
  id: totrans-67
  prefs: []
  type: TYPE_NORMAL
  zh: Jungil Kong，Jaehyeon Kim和Jaekyoung Bae。Hifi-gan：用于高效和高保真度语音合成的生成对抗网络。在H. Larochelle，M.
    Ranzato，R. Hadsell，M.F. Balcan和H. Lin编辑的*神经信息处理系统进展*中，卷33，17022-17033。Curran Associates,
    Inc.，2020年。网址：[https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf](https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf)。
- en: '[KTN+23]'
  id: totrans-68
  prefs: []
  type: TYPE_NORMAL
  zh: '[KTN+23]'
- en: 'Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson,
    and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility
    measures in torchaudio. *arXiv preprint arXiv:2304.01448*, 2023.'
  id: totrans-69
  prefs: []
  type: TYPE_NORMAL
  zh: Anurag Kumar，Ke Tan，Zhaoheng Ni，Pranay Manocha，Xiaohui Zhang，Ethan Henderson和Buye
    Xu。Torchaudio-squim：Torchaudio中无参考语音质量和可懂度测量。*arXiv预印本arXiv:2304.01448*，2023年。
- en: '[LRI+19]'
  id: totrans-70
  prefs: []
  type: TYPE_NORMAL
  zh: '[LRI+19]'
- en: Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua
    Bengio. Speech model pre-training for end-to-end spoken language understanding.
    In Gernot Kubin and Zdravko Kacic, editors, *Proc. of Interspeech*, 814–818\.
    2019.
  id: totrans-71
  prefs: []
  type: TYPE_NORMAL
  zh: Loren Lugosch，Mirco Ravanelli，Patrick Ignoto，Vikrant Singh Tomar和Yoshua Bengio。端到端口语言理解的语音模型预训练。在Gernot
    Kubin和Zdravko Kacic编辑的*Interspeech会议论文集*中，814-818。2019年。
- en: '[LM19]'
  id: totrans-72
  prefs: []
  type: TYPE_NORMAL
  zh: '[LM19]'
- en: 'Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude
    masking for speech separation. *IEEE/ACM Transactions on Audio, Speech, and Language
    Processing*, 27(8):1256–1266, Aug 2019\. URL: [http://dx.doi.org/10.1109/TASLP.2019.2915167](http://dx.doi.org/10.1109/TASLP.2019.2915167),
    [doi:10.1109/taslp.2019.2915167](https://doi.org/10.1109/taslp.2019.2915167).'
  id: totrans-73
  prefs: []
  type: TYPE_NORMAL
  zh: Yi Luo和Nima Mesgarani。Conv-tasnet：超越理想的时频幅度屏蔽进行语音分离。*IEEE/ACM音频、语音和语言处理交易*，27(8)：1256-1266，2019年8月。网址：[http://dx.doi.org/10.1109/TASLP.2019.2915167](http://dx.doi.org/10.1109/TASLP.2019.2915167)，[doi:10.1109/taslp.2019.2915167](https://doi.org/10.1109/taslp.2019.2915167)。
- en: '[MK22]'
  id: totrans-74
  prefs: []
  type: TYPE_NORMAL
  zh: '[MK22]'
- en: Pranay Manocha and Anurag Kumar. Speech quality assessment through mos using
    non-matching references. *arXiv preprint arXiv:2206.12285*, 2022.
  id: totrans-75
  prefs: []
  type: TYPE_NORMAL
  zh: Pranay Manocha和Anurag Kumar。使用非匹配参考进行MOS的语音质量评估。*arXiv预印本arXiv:2206.12285*，2022年。
- en: '[MRFB+15]'
  id: totrans-76
  prefs: []
  type: TYPE_NORMAL
  zh: '[MRFB+15]'
- en: 'Xavier Anguera Miro, Luis Javier Rodriguez-Fuentes, Andi Buzo, Florian Metze,
    Igor Szoke, and Mikel Peñagarikano. Quesst2014: evaluating query-by-example speech
    search in a zero-resource setting with real-life queries. *2015 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP)*, pages 5833–5837,
    2015.'
  id: totrans-77
  prefs: []
  type: TYPE_NORMAL
  zh: Xavier Anguera Miro，Luis Javier Rodriguez-Fuentes，Andi Buzo，Florian Metze，Igor
    Szoke和Mikel Peñagarikano。Quesst2014：在零资源环境中使用真实查询评估基于示例语音搜索。*2015年IEEE国际声学、语音和信号处理会议（ICASSP）*，2015年，页码5833-5837。
- en: '[MPG29]'
  id: totrans-78
  prefs: []
  type: TYPE_NORMAL
  zh: '[MPG29]'
- en: RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung.
    *ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte
    Mathematik und Mechanik*, 9(1):58–77, 1929.
  id: totrans-79
  prefs: []
  type: TYPE_NORMAL
  zh: RV Mises和Hilda Pollaczek-Geiringer。等式求解的实用方法。*ZAMM-应用数学和力学杂志/应用数学和力学杂志*，9(1)：58-77，1929年。
- en: '[Mys14]'
  id: totrans-80
  prefs: []
  type: TYPE_NORMAL
  zh: '[Mys14]'
- en: Gautham J Mysore. Can we automatically transform speech recorded on common consumer
    devices in real-world environments into professional production quality speech?—a
    dataset, insights, and challenges. *IEEE Signal Processing Letters*, 22(8):1006–1010,
    2014.
  id: totrans-81
  prefs: []
  type: TYPE_NORMAL
  zh: Gautham J Mysore。我们能否自动将在真实环境中使用普通消费设备录制的语音转换为专业制作质量的语音？—数据集、见解和挑战。*IEEE信号处理通信*，22(8)：1006-1010，2014年。
- en: '[NCZ17]'
  id: totrans-82
  prefs: []
  type: TYPE_NORMAL
  zh: '[NCZ17]'
- en: 'Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale
    speaker identification dataset. *arXiv preprint arXiv:1706.08612*, 2017.'
  id: totrans-83
  prefs: []
  type: TYPE_NORMAL
  zh: Arsha Nagrani，Joon Son Chung和Andrew Zisserman。Voxceleb：一个大规模的说话者识别数据集。*arXiv预印本arXiv:1706.08612*，2017年。
- en: '[PCPK15]'
  id: totrans-84
  prefs: []
  type: TYPE_NORMAL
  zh: '[PCPK15]'
- en: 'Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech:
    an asr corpus based on public domain audio books. In *2015 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP)*, volume, 5206–5210\.
    2015\. [doi:10.1109/ICASSP.2015.7178964](https://doi.org/10.1109/ICASSP.2015.7178964).'
  id: totrans-85
  prefs: []
  type: TYPE_NORMAL
  zh: Vassil Panayotov，Guoguo Chen，Daniel Povey和Sanjeev Khudanpur。Librispeech：基于公共领域有声书的ASR语料库。在*2015年IEEE国际声学、语音和信号处理会议（ICASSP）*中，卷，5206-5210。2015年。[doi:10.1109/ICASSP.2015.7178964](https://doi.org/10.1109/ICASSP.2015.7178964)。
- en: '[PCZ+19]'
  id: totrans-86
  prefs: []
  type: TYPE_NORMAL
  zh: '[PCZ+19]'
- en: 'Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D.
    Cubuk, and Quoc V. Le. Specaugment: a simple data augmentation method for automatic
    speech recognition. *Interspeech 2019*, Sep 2019\. URL: [http://dx.doi.org/10.21437/Interspeech.2019-2680](http://dx.doi.org/10.21437/Interspeech.2019-2680),
    [doi:10.21437/interspeech.2019-2680](https://doi.org/10.21437/interspeech.2019-2680).'
  id: totrans-87
  prefs: []
  type: TYPE_NORMAL
  zh: Daniel S. Park，William Chan，Yu Zhang，Chung-Cheng Chiu，Barret Zoph，Ekin D. Cubuk和Quoc
    V. Le。Specaugment：一种用于自动语音识别的简单数据增强方法。*Interspeech 2019*，2019年9月。网址：[http://dx.doi.org/10.21437/Interspeech.2019-2680](http://dx.doi.org/10.21437/Interspeech.2019-2680)，[doi:10.21437/interspeech.2019-2680](https://doi.org/10.21437/interspeech.2019-2680)。
- en: '[PBS13]'
  id: totrans-88
  prefs: []
  type: TYPE_NORMAL
  zh: '[PBS13]'
- en: Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. A fast griffin-lim
    algorithm. In *2013 IEEE Workshop on Applications of Signal Processing to Audio
    and Acoustics*, volume, 1–4\. 2013\. [doi:10.1109/WASPAA.2013.6701851](https://doi.org/10.1109/WASPAA.2013.6701851).
  id: totrans-89
  prefs: []
  type: TYPE_NORMAL
  zh: Nathanaël Perraudin，Peter Balazs和Peter L. Søndergaard。一种快速的Griffin-Lim算法。在*2013年IEEE信号处理应用研讨会*中，卷，1-4。2013年。[doi:10.1109/WASPAA.2013.6701851](https://doi.org/10.1109/WASPAA.2013.6701851)。
- en: '[PTS+23]'
  id: totrans-90
  prefs: []
  type: TYPE_NORMAL
  zh: '[PTS+23]'
- en: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani
    Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski,
    Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling
    speech technology to 1,000+ languages. 2023\. [arXiv:2305.13516](https://arxiv.org/abs/2305.13516).
  id: totrans-91
  prefs: []
  type: TYPE_NORMAL
  zh: Vineel Pratap，Andros Tjandra，Bowen Shi，Paden Tomasello，Arun Babu，Sayani Kundu，Ali
    Elkahky，Zhaoheng Ni，Apoorv Vyas，Maryam Fazel-Zarandi，Alexei Baevski，Yossi Adi，张晓辉，徐伟宁，Alexis
    Conneau和Michael Auli。将语音技术扩展到1000多种语言。2023年。arXiv:2305.13516。
- en: '[PXS+20]'
  id: totrans-92
  prefs: []
  type: TYPE_NORMAL
  zh: '[PXS+20]'
- en: 'Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert.
    Mls: a large-scale multilingual dataset for speech research. *Interspeech 2020*,
    Oct 2020\. URL: [http://dx.doi.org/10.21437/Interspeech.2020-2826](http://dx.doi.org/10.21437/Interspeech.2020-2826),
    [doi:10.21437/interspeech.2020-2826](https://doi.org/10.21437/interspeech.2020-2826).'
  id: totrans-93
  prefs: []
  type: TYPE_NORMAL
  zh: Vineel Pratap，Qiantong Xu，Anuroop Sriram，Gabriel Synnaeve和Ronan Collobert。MLS：用于语音研究的大规模多语言数据集。Interspeech
    2020，2020年10月。URL：http://dx.doi.org/10.21437/Interspeech.2020-2826，doi:10.21437/interspeech.2020-2826。
- en: '[RLStoter+19]'
  id: totrans-94
  prefs: []
  type: TYPE_NORMAL
  zh: '[RLStoter+19]'
- en: 'Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis,
    and Rachel Bittner. MUSDB18-HQ - an uncompressed version of musdb18\. December
    2019\. URL: [https://doi.org/10.5281/zenodo.3338373](https://doi.org/10.5281/zenodo.3338373),
    [doi:10.5281/zenodo.3338373](https://doi.org/10.5281/zenodo.3338373).'
  id: totrans-95
  prefs: []
  type: TYPE_NORMAL
  zh: Zafar Rafii，Antoine Liutkus，Fabian-Robert Stöter，Stylianos Ioannis Mimilakis和Rachel
    Bittner。MUSDB18-HQ - musdb18的未压缩版本。2019年12月。URL：https://doi.org/10.5281/zenodo.3338373，doi:10.5281/zenodo.3338373。
- en: '[RGC+20]'
  id: totrans-96
  prefs: []
  type: TYPE_NORMAL
  zh: '[RGC+20]'
- en: 'Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng,
    Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian
    Braun, and others. The interspeech 2020 deep noise suppression challenge: datasets,
    subjective testing framework, and challenge results. *arXiv preprint arXiv:2005.13981*,
    2020.'
  id: totrans-97
  prefs: []
  type: TYPE_NORMAL
  zh: Chandan KA Reddy，Vishak Gopal，Ross Cutler，Ebrahim Beyrami，Roger Cheng，Harishchandra
    Dubey，Sergiy Matusevych，Robert Aichner，Ashkan Aazami，Sebastian Braun等人。Interspeech
    2020深度降噪挑战：数据集，主观测试框架和挑战结果。arXiv预印本arXiv:2005.13981，2020年。
- en: '[RDelegliseEsteve12]'
  id: totrans-98
  prefs: []
  type: TYPE_NORMAL
  zh: '[RDelegliseEsteve12]'
- en: 'Anthony Rousseau, Paul Deléglise, and Yannick Estève. Ted-lium: an automatic
    speech recognition dedicated corpus. In *Conference on Language Resources and
    Evaluation (LREC)*, 125–129\. 2012.'
  id: totrans-99
  prefs: []
  type: TYPE_NORMAL
  zh: 安东尼·鲁索，保罗·德勒格利斯和亚尼克·埃斯特韦。Ted-lium：一种专用于自动语音识别的语料库。在语言资源和评估会议（LREC）中，125-129页。2012年。
- en: '[SY18]'
  id: totrans-100
  prefs: []
  type: TYPE_NORMAL
  zh: '[SY18]'
- en: Seyyed Saeed Sarfjoo and Junichi Yamagishi. Device recorded vctk (small subset
    version). 2018.
  id: totrans-101
  prefs: []
  type: TYPE_NORMAL
  zh: Seyyed Saeed Sarfjoo和山岸淳一。设备录制的vctk（小型子集版本）。2018年。
- en: '[SBDokmanic18]'
  id: totrans-102
  prefs: []
  type: TYPE_NORMAL
  zh: '[SBDokmanic18]'
- en: 'Robin Scheibler, Eric Bezzam, and Ivan Dokmanić. Pyroomacoustics: a python
    package for audio room simulation and array processing algorithms. In *2018 IEEE
    international conference on acoustics, speech and signal processing (ICASSP)*,
    351–355\. IEEE, 2018.'
  id: totrans-103
  prefs: []
  type: TYPE_NORMAL
  zh: 罗宾·施伯勒，埃里克·贝扎姆和伊万·多克曼尼奇。Pyroomacoustics：用于音频房间模拟和阵列处理算法的Python软件包。在2018年IEEE国际声学、语音和信号处理会议（ICASSP）中，351-355页。IEEE，2018年。
- en: '[SPW+18]'
  id: totrans-104
  prefs: []
  type: TYPE_NORMAL
  zh: '[SPW+18]'
- en: Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng
    Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. Natural
    tts synthesis by conditioning wavenet on mel spectrogram predictions. In *2018
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*,
    4779–4783\. IEEE, 2018.
  id: totrans-105
  prefs: []
  type: TYPE_NORMAL
  zh: 乔纳森·申，Ruoming Pang，Ron J Weiss，Mike Schuster，Navdeep Jaitly，Zongheng Yang，Zhifeng
    Chen，张宇，王宇轩，Rj Skerrv-Ryan等人。通过在mel频谱图预测上对wavenet进行条件化的自然tts合成。在2018年IEEE国际声学、语音和信号处理会议（ICASSP）中，4779-4783页。IEEE，2018年。
- en: '[SWW+21]'
  id: totrans-106
  prefs: []
  type: TYPE_NORMAL
  zh: '[SWW+21]'
- en: 'Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank
    Zhang, Duc Le, and Mike Seltzer. Emformer: efficient memory transformer based
    acoustic model for low latency streaming speech recognition. In *ICASSP 2021 -
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP)*, 6783–6787\. 2021.'
  id: totrans-107
  prefs: []
  type: TYPE_NORMAL
  zh: 杨洋石，王永强，吴春阳，叶青峰，陈俊，张弗兰克，勒杜克和迈克·塞尔策。Emformer：用于低延迟流式语音识别的高效内存变压器基础声学模型。在ICASSP
    2021 - 2021年IEEE国际声学、语音和信号处理会议（ICASSP）中，6783-6787页。2021年。
- en: '[SWW+22]'
  id: totrans-108
  prefs: []
  type: TYPE_NORMAL
  zh: '[SWW+22]'
- en: Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang,
    Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, and Mike Seltzer.
    Streaming transformer transducer based speech recognition using non-causal convolution.
    In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP)*, volume, 8277–8281\. 2022\. [doi:10.1109/ICASSP43922.2022.9747706](https://doi.org/10.1109/ICASSP43922.2022.9747706).
  id: totrans-109
  prefs: []
  type: TYPE_NORMAL
  zh: 杨洋石，春阳吴，迪林王，Alex Xiao，Jay Mahadeokar，张晓辉，刘春喜，李克，尚冠元，瓦伦·纳加拉贾，奥兹莱姆·卡林利和迈克·塞尔策。基于非因果卷积的流式变压器传导器语音识别。在ICASSP
    2022 - 2022年IEEE国际声学、语音和信号处理会议（ICASSP）中，卷，8277-8281页。2022年。doi:10.1109/ICASSP43922.2022.9747706。
- en: '[Smi20]'
  id: totrans-110
  prefs: []
  type: TYPE_NORMAL
  zh: '[Smi20]'
- en: 'Julius O. Smith. Digital audio resampling home page "theory of ideal bandlimited
    interpolation" section. September 2020\. URL: [https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html](https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html).'
  id: totrans-111
  prefs: []
  type: TYPE_NORMAL
  zh: 朱利叶斯·O·史密斯。数字音频重采样主页“理想带限插值理论”部分。2020年9月。URL：https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html。
- en: '[SCP15]'
  id: totrans-112
  prefs: []
  type: TYPE_NORMAL
  zh: '[SCP15]'
- en: 'David Snyder, Guoguo Chen, and Daniel Povey. MUSAN: A Music, Speech, and Noise
    Corpus. 2015\. arXiv:1510.08484v1\. [arXiv:1510.08484](https://arxiv.org/abs/1510.08484).'
  id: totrans-113
  prefs: []
  type: TYPE_NORMAL
  zh: 大卫·斯奈德，陈国国和丹尼尔·波维。MUSAN：一个音乐、语音和噪声语料库。2015年。arXiv:1510.08484v1。arXiv:1510.08484。
- en: '[SBA09]'
  id: totrans-114
  prefs: []
  type: TYPE_NORMAL
  zh: '[SBA09]'
- en: Mehrez Souden, Jacob Benesty, and Sofiene Affes. On optimal frequency-domain
    multichannel linear filtering for noise reduction. In *IEEE Transactions on audio,
    speech, and language processing*, volume 18, 260–276\. IEEE, 2009.
  id: totrans-115
  prefs: []
  type: TYPE_NORMAL
  zh: Mehrez Souden，Jacob Benesty和Sofiene Affes。关于噪声降低的最佳频域多通道线性滤波。在IEEE音频、语音和语言处理交易中，卷18，260-276页。IEEE，2009年。
- en: '[SWT+22]'
  id: totrans-116
  prefs: []
  type: TYPE_NORMAL
  zh: '[SWT+22]'
- en: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika
    Singh, and Yatharth Saraf. Conformer-based self-supervised learning for non-speech
    audio tasks. In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)*, volume, 8862–8866\. 2022\. [doi:10.1109/ICASSP43922.2022.9746490](https://doi.org/10.1109/ICASSP43922.2022.9746490).
  id: totrans-117
  prefs: []
  type: TYPE_NORMAL
  zh: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika
    Singh, and Yatharth Saraf. Conformer-based self-supervised learning for non-speech
    audio tasks. In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)*, volume, 8862–8866\. 2022\. [doi:10.1109/ICASSP43922.2022.9746490](https://doi.org/10.1109/ICASSP43922.2022.9746490).
- en: '[TEC01]'
  id: totrans-118
  prefs: []
  type: TYPE_NORMAL
  zh: '[TEC01]'
- en: 'George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification
    of audio signals. 2001\. URL: [http://ismir2001.ismir.net/pdf/tzanetakis.pdf](http://ismir2001.ismir.net/pdf/tzanetakis.pdf).'
  id: totrans-119
  prefs: []
  type: TYPE_NORMAL
  zh: 'George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification
    of audio signals. 2001\. URL: [http://ismir2001.ismir.net/pdf/tzanetakis.pdf](http://ismir2001.ismir.net/pdf/tzanetakis.pdf).'
- en: '[VAlumae21]'
  id: totrans-120
  prefs: []
  type: TYPE_NORMAL
  zh: '[VAlumae21]'
- en: 'Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition.
    In *2021 IEEE Spoken Language Technology Workshop (SLT)*, 652–658\. IEEE, 2021.'
  id: totrans-121
  prefs: []
  type: TYPE_NORMAL
  zh: 'Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition.
    In *2021 IEEE Spoken Language Technology Workshop (SLT)*, 652–658\. IEEE, 2021.'
- en: '[WRiviereL+21]'
  id: totrans-122
  prefs: []
  type: TYPE_NORMAL
  zh: '[WRiviereL+21]'
- en: 'Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel
    Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale
    multilingual speech corpus for representation learning, semi-supervised learning
    and interpretation. *CoRR*, 2021\. URL: [https://arxiv.org/abs/2101.00390](https://arxiv.org/abs/2101.00390),
    [arXiv:2101.00390](https://arxiv.org/abs/2101.00390).'
  id: totrans-123
  prefs: []
  type: TYPE_NORMAL
  zh: 'Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel
    Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale
    multilingual speech corpus for representation learning, semi-supervised learning
    and interpretation. *CoRR*, 2021\. URL: [https://arxiv.org/abs/2101.00390](https://arxiv.org/abs/2101.00390),
    [arXiv:2101.00390](https://arxiv.org/abs/2101.00390).'
- en: '[Wei98]'
  id: totrans-124
  prefs: []
  type: TYPE_NORMAL
  zh: '[Wei98]'
- en: 'R.L. Weide. The carnegie mellon pronuncing dictionary. 1998\. URL: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).'
  id: totrans-125
  prefs: []
  type: TYPE_NORMAL
  zh: 'R.L. Weide. The carnegie mellon pronuncing dictionary. 1998\. URL: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).'
- en: '[YVM19]'
  id: totrans-126
  prefs: []
  type: TYPE_NORMAL
  zh: '[YVM19]'
- en: 'Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus:
    english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019\.
    [doi:10.7488/ds/2645](https://doi.org/10.7488/ds/2645).'
  id: totrans-127
  prefs: []
  type: TYPE_NORMAL
  zh: 'Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus:
    english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019\.
    [doi:10.7488/ds/2645](https://doi.org/10.7488/ds/2645).'
- en: '[ZDC+19]'
  id: totrans-128
  prefs: []
  type: TYPE_NORMAL
  zh: '[ZDC+19]'
- en: 'Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia,
    Z. Chen, and Yonghui Wu. Libritts: a corpus derived from librispeech for text-to-speech.
    *ArXiv*, 2019.'
  id: totrans-129
  prefs: []
  type: TYPE_NORMAL
  zh: 'Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia,
    Z. Chen, and Yonghui Wu. Libritts: a corpus derived from librispeech for text-to-speech.
    *ArXiv*, 2019.'
- en: '[ZSN21]'
  id: totrans-130
  prefs: []
  type: TYPE_NORMAL
  zh: '[ZSN21]'
- en: Albert Zeyer, Ralf Schlüter, and Hermann Ney. Why does ctc result in peaky behavior?
    2021\. [arXiv:2105.14849](https://arxiv.org/abs/2105.14849).
  id: totrans-131
  prefs: []
  type: TYPE_NORMAL
  zh: Albert Zeyer, Ralf Schlüter, and Hermann Ney. Why does ctc result in peaky behavior?
    2021\. [arXiv:2105.14849](https://arxiv.org/abs/2105.14849).
- en: '[BrianMcFeeColinRaffelDawenLiang+15]'
  id: totrans-132
  prefs: []
  type: TYPE_NORMAL
  zh: '[BrianMcFeeColinRaffelDawenLiang+15]'
- en: 'Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric
    Battenberg, and Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python.
    In Kathryn Huff and James Bergstra, editors, *Proceedings of the 14th Python in
    Science Conference*, 18 – 24\. 2015\. [doi:10.25080/Majora-7b98e3ed-003](https://doi.org/10.25080/Majora-7b98e3ed-003).'
  id: totrans-133
  prefs: []
  type: TYPE_NORMAL
  zh: 'Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric
    Battenberg, and Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python.
    In Kathryn Huff and James Bergstra, editors, *Proceedings of the 14th Python in
    Science Conference*, 18 – 24\. 2015\. [doi:10.25080/Majora-7b98e3ed-003](https://doi.org/10.25080/Majora-7b98e3ed-003).'
- en: '[KahnRiviereZheng+20]'
  id: totrans-134
  prefs: []
  type: TYPE_NORMAL
  zh: '[KahnRiviereZheng+20]'
- en: 'J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi,
    V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin,
    A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no
    supervision. In *ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)*, 7669–7673\. 2020\. [https://github.com/facebookresearch/libri-light](https://github.com/facebookresearch/libri-light).'
  id: totrans-135
  prefs: []
  type: TYPE_NORMAL
  zh: 'J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi,
    V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin,
    A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no
    supervision. In *ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)*, 7669–7673\. 2020\. [https://github.com/facebookresearch/libri-light](https://github.com/facebookresearch/libri-light).'
- en: '[Warden18]'
  id: totrans-136
  prefs: []
  type: TYPE_NORMAL
  zh: '[Warden18]'
- en: 'P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.
    *ArXiv e-prints*, April 2018\. URL: [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209),
    [arXiv:1804.03209](https://arxiv.org/abs/1804.03209).'
  id: totrans-137
  prefs: []
  type: TYPE_NORMAL
  zh: 'P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.
    *ArXiv e-prints*, April 2018\. URL: [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209),
    [arXiv:1804.03209](https://arxiv.org/abs/1804.03209).'
- en: '[Wikipediacontributors]'
  id: totrans-138
  prefs: []
  type: TYPE_NORMAL
  zh: '[Wikipediacontributors]'
- en: 'Wikipedia contributors. Absorption (acoustics) — Wikipedia, the free encyclopedia.
    [Online]. URL: [https://en.wikipedia.org/wiki/Absorption_(acoustics)](https://en.wikipedia.org/wiki/Absorption_(acoustics)).'
  id: totrans-139
  prefs: []
  type: TYPE_NORMAL
  zh: 'Wikipedia contributors. Absorption (acoustics) — Wikipedia, the free encyclopedia.
    [Online]. URL: [https://en.wikipedia.org/wiki/Absorption_(acoustics)](https://en.wikipedia.org/wiki/Absorption_(acoustics)).'