- en: References id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL zh: 参考文献 - en: 原文:[https://pytorch.org/audio/stable/references.html](https://pytorch.org/audio/stable/references.html) id: totrans-1 prefs: - PREF_BQ type: TYPE_NORMAL zh: '[https://pytorch.org/audio/stable/references.html](https://pytorch.org/audio/stable/references.html)' - en: '[Yes]' id: totrans-2 prefs: [] type: TYPE_NORMAL zh: '[Yes]' - en: 'Yesno. URL: [http://www.openslr.org/1/](http://www.openslr.org/1/).' id: totrans-3 prefs: [] type: TYPE_NORMAL zh: Yesno。网址:[http://www.openslr.org/1/](http://www.openslr.org/1/)。 - en: '[AB79]' id: totrans-4 prefs: [] type: TYPE_NORMAL zh: '[AB79]' - en: Jont B Allen and David A Berkley. Image method for efficiently simulating small-room acoustics. *The Journal of the Acoustical Society of America*, 65(4):943–950, 1979. id: totrans-5 prefs: [] type: TYPE_NORMAL zh: Jont B Allen和David A Berkley。用于高效模拟小房间声学的图像方法。*美国声学学会杂志*,65(4):943-950,1979年。 - en: '[ABD+20]' id: totrans-6 prefs: [] type: TYPE_NORMAL zh: '[ABD+20]' - en: 'Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020\. [arXiv:1912.06670](https://arxiv.org/abs/1912.06670).' id: totrans-7 prefs: [] type: TYPE_NORMAL zh: Rosana Ardila,Megan Branson,Kelly Davis,Michael Henretty,Michael Kohler,Josh Meyer,Reuben Morais,Lindsay Saunders,Francis M. Tyers和Gregor Weber。Common voice:一个大规模多语言语音语料库。2020年。[arXiv:1912.06670](https://arxiv.org/abs/1912.06670)。 - en: '[BWT+21]' id: totrans-8 prefs: [] type: TYPE_NORMAL zh: '[BWT+21]' - en: 'Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, and others. Xls-r: self-supervised cross-lingual speech representation learning at scale. *arXiv preprint arXiv:2111.09296*, 2021.' id: totrans-9 prefs: [] type: TYPE_NORMAL zh: Arun Babu,王长翰,Andros Tjandra,Kushal Lakhotia,徐前通,Naman Goyal,Kritika Singh,Patrick von Platen,Yatharth Saraf,Juan Pino等人。Xls-r:规模化的自监督跨语言语音表示学习。*arXiv预印本arXiv:2111.09296*,2021年。 - en: '[BZMA20]' id: totrans-10 prefs: [] type: TYPE_NORMAL zh: '[BZMA20]' - en: 'Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020\. [arXiv:2006.11477](https://arxiv.org/abs/2006.11477).' id: totrans-11 prefs: [] type: TYPE_NORMAL zh: Alexei Baevski,Henry Zhou,Abdelrahman Mohamed和Michael Auli。Wav2vec 2.0:一种用于自监督学习语音表示的框架。2020年。[arXiv:2006.11477](https://arxiv.org/abs/2006.11477)。 - en: '[BBL+08]' id: totrans-12 prefs: [] type: TYPE_NORMAL zh: '[BBL+08]' - en: 'Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. Iemocap: interactive emotional dyadic motion capture database. *Language Resources and Evaluation*, 42:335–359, 12 2008\. [doi:10.1007/s10579-008-9076-6](https://doi.org/10.1007/s10579-008-9076-6).' id: totrans-13 prefs: [] type: TYPE_NORMAL zh: Carlos Busso,Murtaza Bulut,李志俊,Abe Kazemzadeh,Emily Mower Provost,Samuel Kim,Jeannette Chang,李成博,Shrikanth Narayanan。Iemocap:交互式情感二元动作捕捉数据库。*语言资源与评估*,42:335-359,2008年12月。[doi:10.1007/s10579-008-9076-6](https://doi.org/10.1007/s10579-008-9076-6)。 - en: '[Cap69]' id: totrans-14 prefs: [] type: TYPE_NORMAL zh: '[Cap69]' - en: Jack Capon. High-resolution frequency-wavenumber spectrum analysis. *Proceedings of the IEEE*, 57(8):1408–1418, 1969. id: totrans-15 prefs: [] type: TYPE_NORMAL zh: Jack Capon。高分辨率频率-波数谱分析。*IEEE会议论文集*,57(8):1408-1418,1969年。 - en: '[CDiGangiB+21]' id: totrans-16 prefs: [] type: TYPE_NORMAL zh: '[CDiGangiB+21]' - en: 'Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Must-c: a multilingual corpus for end-to-end speech translation. *Computer Speech & Language*, 66:101155, 2021\. URL: [https://www.sciencedirect.com/science/article/pii/S0885230820300887](https://www.sciencedirect.com/science/article/pii/S0885230820300887), [doi:https://doi.org/10.1016/j.csl.2020.101155](https://doi.org/https://doi.org/10.1016/j.csl.2020.101155).' id: totrans-17 prefs: [] type: TYPE_NORMAL zh: Roldano Cattoni,Mattia Antonino Di Gangi,Luisa Bentivogli,Matteo Negri和Marco Turchi。Must-c:用于端到端语音翻译的多语言语料库。*计算机语音与语言*,66:101155,2021年。网址:[https://www.sciencedirect.com/science/article/pii/S0885230820300887](https://www.sciencedirect.com/science/article/pii/S0885230820300887),[doi:https://doi.org/10.1016/j.csl.2020.101155](https://doi.org/https://doi.org/10.1016/j.csl.2020.101155)。 - en: '[CCW+21]' id: totrans-18 prefs: [] type: TYPE_NORMAL zh: '[CCW+21]' - en: 'Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. Gigaspeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In *Proc. Interspeech 2021*. 2021.' id: totrans-19 prefs: [] type: TYPE_NORMAL zh: Guoguo Chen,柴树洲,王冠波,杜佳宇,张伟强,翁超,苏丹,Daniel Povey,Jan Trmal,张俊博,金明杰,Sanjeev Khudanpur,Shinji Watanabe,赵帅江,邹伟,李相刚,姚旭晨,王永庆,王玉军,尤赵,严志勇。Gigaspeech:一个不断发展的、多领域的带有10000小时转录音频的自动语音识别语料库。在*Interspeech 2021*会议上。2021年。 - en: '[CWC+22]' id: totrans-20 prefs: [] type: TYPE_NORMAL zh: '[CWC+22]' - en: 'Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and others. Wavlm: large-scale self-supervised pre-training for full stack speech processing. *IEEE Journal of Selected Topics in Signal Processing*, 16(6):1505–1518, 2022.' id: totrans-21 prefs: [] type: TYPE_NORMAL zh: 三元陈,程毅王,正阳陈,宇吴,树杰刘,卓陈,金宇李,神行祥,吉田直之,吉冈拓也,肖雄,等人。Wavlm:用于全栈语音处理的大规模自监督预训练。*IEEE信号处理领域选题杂志*,16(6):1505-1518,2022年。 - en: '[CPS16]' id: totrans-22 prefs: [] type: TYPE_NORMAL zh: '[CPS16]' - en: 'Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016\. [arXiv:1609.03193](https://arxiv.org/abs/1609.03193).' id: totrans-23 prefs: [] type: TYPE_NORMAL zh: Ronan Collobert,Christian Puhrsch和Gabriel Synnaeve。Wav2letter:一种端到端的基于卷积神经网络的语音识别系统。2016年。[arXiv:1609.03193](https://arxiv.org/abs/1609.03193)。 - en: '[CBC+20]' id: totrans-24 prefs: [] type: TYPE_NORMAL zh: '[CBC+20]' - en: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2020\. [arXiv:2006.13979](https://arxiv.org/abs/2006.13979). id: totrans-25 prefs: [] type: TYPE_NORMAL zh: Alexis Conneau,Alexei Baevski,Ronan Collobert,Abdelrahman Mohamed和Michael Auli。用于语音识别的无监督跨语言表示学习。2020年。[arXiv:2006.13979](https://arxiv.org/abs/2006.13979)。 - en: '[CY21]' id: totrans-26 prefs: [] type: TYPE_NORMAL zh: '[CY21]' - en: Erica Cooper and Junichi Yamagishi. How do voices from past speech synthesis challenges compare today? *arXiv preprint arXiv:2105.02373*, 2021. id: totrans-27 prefs: [] type: TYPE_NORMAL zh: Erica Cooper和山岸纯一。过去语音合成挑战中的声音如何与今天相比?*arXiv预印本arXiv:2105.02373*,2021年。 - en: '[CPC+20]' id: totrans-28 prefs: [] type: TYPE_NORMAL zh: '[CPC+20]' - en: 'Joris Cosentino, Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. Librimix: an open-source dataset for generalizable speech separation. 2020\. [arXiv:2005.11262](https://arxiv.org/abs/2005.11262).' id: totrans-29 prefs: [] type: TYPE_NORMAL zh: Joris Cosentino,Manuel Pariente,Samuele Cornell,Antoine Deleforge和Emmanuel Vincent。Librimix:一个用于通用语音分离的开源数据集。2020年。[arXiv:2005.11262](https://arxiv.org/abs/2005.11262)。 - en: '[CSB+18]' id: totrans-30 prefs: [] type: TYPE_NORMAL zh: '[CSB+18]' - en: 'Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, and others. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. *arXiv preprint arXiv:1805.10190*, 2018.' id: totrans-31 prefs: [] type: TYPE_NORMAL zh: Alice Coucke,Alaa Saade,Adrien Ball,Théodore Bluche,Alexandre Caulier,David Leroy,Clément Doumouro,Thibault Gisselbrecht,Francesco Caltagirone,Thibaut Lavril等人。Snips语音平台:一种用于私密设计语音界面的嵌入式口语理解系统。*arXiv预印本arXiv:1805.10190*,2018年。 - en: '[DL82]' id: totrans-32 prefs: [] type: TYPE_NORMAL zh: '[DL82]' - en: DC Dowson and BV666017 Landau. The fréchet distance between multivariate normal distributions. *Journal of multivariate analysis*, 12(3):450–455, 1982. id: totrans-33 prefs: [] type: TYPE_NORMAL zh: DC道森和BV666017兰道。多元正态分布之间的弗雷歇距离。*多元分析杂志*,12(3):450-455,1982年。 - en: '[Defossez21]' id: totrans-34 prefs: [] type: TYPE_NORMAL zh: '[Defossez21]' - en: Alexandre Défossez. Hybrid spectrogram and waveform source separation. In *Proceedings of the ISMIR 2021 Workshop on Music Source Separation*. 2021. id: totrans-35 prefs: [] type: TYPE_NORMAL zh: 亚历山大·德福塞。混合谱图和波形源分离。在*ISMIR 2021音乐源分离研讨会论文集*中。2021年。 - en: '[GKRR14]' id: totrans-36 prefs: [] type: TYPE_NORMAL zh: '[GKRR14]' - en: 'Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In *SLTU*. 2014.' id: totrans-37 prefs: [] type: TYPE_NORMAL zh: 马克·约翰·弗朗西斯·盖尔斯、凯特·尼尔、安东·拉格尼和沙克提·普拉萨德·拉特。低资源语言的语音识别和关键词检测:剑桥大学babel项目研究。在*SLTU*中。2014年。 - en: '[Gra12]' id: totrans-38 prefs: [] type: TYPE_NORMAL zh: '[Gra12]' - en: Alex Graves. Sequence transduction with recurrent neural networks. 2012\. [arXiv:1211.3711](https://arxiv.org/abs/1211.3711). id: totrans-39 prefs: [] type: TYPE_NORMAL zh: 亚历克斯·格雷夫斯。使用递归神经网络进行序列转导。2012年。[arXiv:1211.3711](https://arxiv.org/abs/1211.3711)。 - en: '[GL83]' id: totrans-40 prefs: [] type: TYPE_NORMAL zh: '[GL83]' - en: D. Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. In *ICASSP '83\. IEEE International Conference on Acoustics, Speech, and Signal Processing*, volume 8, 804–807\. 1983\. [doi:10.1109/ICASSP.1983.1172092](https://doi.org/10.1109/ICASSP.1983.1172092). id: totrans-41 prefs: [] type: TYPE_NORMAL zh: D.格里芬和林杰。从修改后的短时傅里叶变换中估计信号。在*ICASSP '83。IEEE国际声学、语音和信号处理会议*中,卷8,804-807。1983年。[doi:10.1109/ICASSP.1983.1172092](https://doi.org/10.1109/ICASSP.1983.1172092)。 - en: '[GQC+20]' id: totrans-42 prefs: [] type: TYPE_NORMAL zh: '[GQC+20]' - en: 'Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: convolution-augmented transformer for speech recognition. 2020\. [arXiv:2005.08100](https://arxiv.org/abs/2005.08100).' id: totrans-43 prefs: [] type: TYPE_NORMAL zh: 安莫尔·古拉蒂、詹姆斯·秦、邱中成、尼基·帕马尔、张宇、余佳辉、韩伟、王世博、张正东、吴永辉和庞若明。Conformer:用于语音识别的卷积增强变压器。2020年。[arXiv:2005.08100](https://arxiv.org/abs/2005.08100)。 - en: '[HCC+14]' id: totrans-44 prefs: [] type: TYPE_NORMAL zh: '[HCC+14]' - en: 'Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014\. [arXiv:1412.5567](https://arxiv.org/abs/1412.5567).' id: totrans-45 prefs: [] type: TYPE_NORMAL zh: 奥尼·汉农、卡尔·凯斯、贾里德·卡斯珀、布莱恩·卡坦扎罗、格雷格·迪阿莫斯、埃里希·埃尔森、瑞安·普伦格、桑杰夫·萨蒂什、舒博·森古普塔、亚当·科茨和安德鲁·Y. 吴。深度语音:扩展端到端语音识别。2014年。[arXiv:1412.5567](https://arxiv.org/abs/1412.5567)。 - en: '[HCE+17]' id: totrans-46 prefs: [] type: TYPE_NORMAL zh: '[HCE+17]' - en: 'Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In *International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. 2017\. URL: [https://arxiv.org/abs/1609.09430](https://arxiv.org/abs/1609.09430).' id: totrans-47 prefs: [] type: TYPE_NORMAL zh: 肖恩·赫尔希、索里什·乔杜里、丹尼尔·P. W. 艾利斯、约特·F. 格梅克、阿伦·詹森、查宁·摩尔、马诺杰·普拉卡尔、德文·普拉特、里夫·A. 索罗斯、布莱恩·塞伯尔德、马尔科姆·斯兰尼、罗恩·韦斯和凯文·威尔逊。用于大规模音频分类的CNN架构。在*国际声学、语音和信号处理会议(ICASSP)*中。2017年。网址:[https://arxiv.org/abs/1609.09430](https://arxiv.org/abs/1609.09430)。 - en: '[HIA+17]' id: totrans-48 prefs: [] type: TYPE_NORMAL zh: '[HIA+17]' - en: Takuya Higuchi, Nobutaka Ito, Shoko Araki, Takuya Yoshioka, Marc Delcroix, and Tomohiro Nakatani. Online mvdr beamformer based on complex gaussian mixture model with spatial prior for noise robust asr. *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 25(4):780–793, 2017. id: totrans-49 prefs: [] type: TYPE_NORMAL zh: 樋口拓也、伊藤伸孝、荒木祥子、吉冈拓也、马克·德尔克罗伊和中谷智博。基于复高斯混合模型的在线mvdr波束形成器,具有空间先验用于噪声鲁棒的asr。*IEEE/ACM音频、语音和语言处理交易*,25(4):780-793,2017年。 - en: '[HIYN16]' id: totrans-50 prefs: [] type: TYPE_NORMAL zh: '[HIYN16]' - en: Takuya Higuchi, Nobutaka Ito, Takuya Yoshioka, and Tomohiro Nakatani. Robust mvdr beamforming using time-frequency masks for online/offline asr in noise. In *2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 5210–5214\. IEEE, 2016. id: totrans-51 prefs: [] type: TYPE_NORMAL zh: 樋口拓也、伊藤伸孝、吉冈拓也和中谷智博。使用时频掩模进行在线/离线噪声下的鲁棒mvdr波束形成。在*2016年IEEE国际声学、语音和信号处理会议(ICASSP)*中,5210-5214。IEEE,2016年。 - en: '[HBT+21]' id: totrans-52 prefs: [] type: TYPE_NORMAL zh: '[HBT+21]' - en: 'Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021\. [arXiv:2106.07447](https://arxiv.org/abs/2106.07447).' id: totrans-53 prefs: [] type: TYPE_NORMAL zh: 徐伟宁、本杰明·博尔特、蔡耀宏、库沙尔·拉克霍蒂亚、鲁斯兰·萨拉胡特迪诺夫和阿卜杜勒拉曼·穆罕默德。Hubert:通过隐藏单元的掩码预测进行自监督语音表示学习。2021年。[arXiv:2106.07447](https://arxiv.org/abs/2106.07447)。 - en: '[IJ17]' id: totrans-54 prefs: [] type: TYPE_NORMAL zh: '[IJ17]' - en: Keith Ito and Linda Johnson. The lj speech dataset. [https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-Dataset/), 2017. id: totrans-55 prefs: [] type: TYPE_NORMAL zh: 基思伊托和琳达约翰逊。LJ语音数据集。[https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-Dataset/),2017年。 - en: '[KPL+22]' id: totrans-56 prefs: [] type: TYPE_NORMAL zh: '[KPL+22]' - en: 'Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, and others. Flashlight: enabling innovation in tools for machine learning. *arXiv preprint arXiv:2201.12465*, 2022.' id: totrans-57 prefs: [] type: TYPE_NORMAL zh: 雅各布·卡恩、维尼尔·普拉塔普、塔蒂亚娜·利霍马年科、钱通徐、奥尼·汉农、杰夫·凯、帕登·托马塞洛、安·李、埃杜瓦·格雷夫、吉拉德·阿维多夫等。Flashlight:为机器学习工具创新提供支持。*arXiv预印本arXiv:2201.12465*,2022年。 - en: '[KES+18a]' id: totrans-58 prefs: [] type: TYPE_NORMAL zh: '[KES+18a]' - en: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. 2018\. [arXiv:1802.08435](https://arxiv.org/abs/1802.08435). id: totrans-59 prefs: [] type: TYPE_NORMAL zh: 纳尔·卡尔布伦纳、埃里希·埃尔森、卡伦·西蒙扬、塞布·努里、诺曼·卡萨格兰德、爱德华·洛克哈特、弗洛里安·斯蒂姆伯格、亚伦·范登·奥尔德、桑德·迪勒曼和科雷·卡武克乔格卢。高效的神经音频合成。2018年。[arXiv:1802.08435](https://arxiv.org/abs/1802.08435)。 - en: '[KES+18b]' id: totrans-60 prefs: [] type: TYPE_NORMAL zh: '[KES+18b]' - en: 'Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. *CoRR*, 2018\. URL: [http://arxiv.org/abs/1802.08435](http://arxiv.org/abs/1802.08435), [arXiv:1802.08435](https://arxiv.org/abs/1802.08435).' id: totrans-61 prefs: [] type: TYPE_NORMAL zh: 纳尔·卡尔布伦纳、埃里希·埃尔森、卡伦·西蒙扬、塞布·努里、诺曼·卡萨格兰德、爱德华·洛克哈特、弗洛里安·斯蒂姆伯格、阿伦·范登·奥尔德、桑德·迪勒曼和科雷·卡武克乔格卢。高效的神经音频合成。*CoRR*,2018年。网址:[http://arxiv.org/abs/1802.08435](http://arxiv.org/abs/1802.08435),[arXiv:1802.08435](https://arxiv.org/abs/1802.08435)。 - en: '[KPPK15]' id: totrans-62 prefs: [] type: TYPE_NORMAL zh: '[KPPK15]' - en: Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech recognition. In *Proc. Interspeech 2015*, 3586–3589\. 2015\. [doi:10.21437/Interspeech.2015-711](https://doi.org/10.21437/Interspeech.2015-711). id: totrans-63 prefs: [] type: TYPE_NORMAL zh: Tom Ko,Vijayaditya Peddinti,Daniel Povey和Sanjeev Khudanpur。用于语音识别的音频增强。在*Interspeech 2015会议论文集*中,3586-3589。2015年。[doi:10.21437/Interspeech.2015-711](https://doi.org/10.21437/Interspeech.2015-711)。 - en: '[KBV03]' id: totrans-64 prefs: [] type: TYPE_NORMAL zh: '[KBV03]' - en: John Kominek, Alan W Black, and Ver Ver. Cmu arctic databases for speech synthesis. Technical Report, 2003. id: totrans-65 prefs: [] type: TYPE_NORMAL zh: John Kominek,Alan W Black和Ver Ver。用于语音合成的CMU北极数据库。技术报告,2003年。 - en: '[KKB20]' id: totrans-66 prefs: [] type: TYPE_NORMAL zh: '[KKB20]' - en: 'Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, *Advances in Neural Information Processing Systems*, volume 33, 17022–17033\. Curran Associates, Inc., 2020\. URL: [https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf](https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf).' id: totrans-67 prefs: [] type: TYPE_NORMAL zh: Jungil Kong,Jaehyeon Kim和Jaekyoung Bae。Hifi-gan:用于高效和高保真度语音合成的生成对抗网络。在H. Larochelle,M. Ranzato,R. Hadsell,M.F. Balcan和H. Lin编辑的*神经信息处理系统进展*中,卷33,17022-17033。Curran Associates, Inc.,2020年。网址:[https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf](https://proceedings.neurips.cc/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf)。 - en: '[KTN+23]' id: totrans-68 prefs: [] type: TYPE_NORMAL zh: '[KTN+23]' - en: 'Anurag Kumar, Ke Tan, Zhaoheng Ni, Pranay Manocha, Xiaohui Zhang, Ethan Henderson, and Buye Xu. Torchaudio-squim: reference-less speech quality and intelligibility measures in torchaudio. *arXiv preprint arXiv:2304.01448*, 2023.' id: totrans-69 prefs: [] type: TYPE_NORMAL zh: Anurag Kumar,Ke Tan,Zhaoheng Ni,Pranay Manocha,Xiaohui Zhang,Ethan Henderson和Buye Xu。Torchaudio-squim:Torchaudio中无参考语音质量和可懂度测量。*arXiv预印本arXiv:2304.01448*,2023年。 - en: '[LRI+19]' id: totrans-70 prefs: [] type: TYPE_NORMAL zh: '[LRI+19]' - en: Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding. In Gernot Kubin and Zdravko Kacic, editors, *Proc. of Interspeech*, 814–818\. 2019. id: totrans-71 prefs: [] type: TYPE_NORMAL zh: Loren Lugosch,Mirco Ravanelli,Patrick Ignoto,Vikrant Singh Tomar和Yoshua Bengio。端到端口语言理解的语音模型预训练。在Gernot Kubin和Zdravko Kacic编辑的*Interspeech会议论文集*中,814-818。2019年。 - en: '[LM19]' id: totrans-72 prefs: [] type: TYPE_NORMAL zh: '[LM19]' - en: 'Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 27(8):1256–1266, Aug 2019\. URL: [http://dx.doi.org/10.1109/TASLP.2019.2915167](http://dx.doi.org/10.1109/TASLP.2019.2915167), [doi:10.1109/taslp.2019.2915167](https://doi.org/10.1109/taslp.2019.2915167).' id: totrans-73 prefs: [] type: TYPE_NORMAL zh: Yi Luo和Nima Mesgarani。Conv-tasnet:超越理想的时频幅度屏蔽进行语音分离。*IEEE/ACM音频、语音和语言处理交易*,27(8):1256-1266,2019年8月。网址:[http://dx.doi.org/10.1109/TASLP.2019.2915167](http://dx.doi.org/10.1109/TASLP.2019.2915167),[doi:10.1109/taslp.2019.2915167](https://doi.org/10.1109/taslp.2019.2915167)。 - en: '[MK22]' id: totrans-74 prefs: [] type: TYPE_NORMAL zh: '[MK22]' - en: Pranay Manocha and Anurag Kumar. Speech quality assessment through mos using non-matching references. *arXiv preprint arXiv:2206.12285*, 2022. id: totrans-75 prefs: [] type: TYPE_NORMAL zh: Pranay Manocha和Anurag Kumar。使用非匹配参考进行MOS的语音质量评估。*arXiv预印本arXiv:2206.12285*,2022年。 - en: '[MRFB+15]' id: totrans-76 prefs: [] type: TYPE_NORMAL zh: '[MRFB+15]' - en: 'Xavier Anguera Miro, Luis Javier Rodriguez-Fuentes, Andi Buzo, Florian Metze, Igor Szoke, and Mikel Peñagarikano. Quesst2014: evaluating query-by-example speech search in a zero-resource setting with real-life queries. *2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, pages 5833–5837, 2015.' id: totrans-77 prefs: [] type: TYPE_NORMAL zh: Xavier Anguera Miro,Luis Javier Rodriguez-Fuentes,Andi Buzo,Florian Metze,Igor Szoke和Mikel Peñagarikano。Quesst2014:在零资源环境中使用真实查询评估基于示例语音搜索。*2015年IEEE国际声学、语音和信号处理会议(ICASSP)*,2015年,页码5833-5837。 - en: '[MPG29]' id: totrans-78 prefs: [] type: TYPE_NORMAL zh: '[MPG29]' - en: RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung. *ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik*, 9(1):58–77, 1929. id: totrans-79 prefs: [] type: TYPE_NORMAL zh: RV Mises和Hilda Pollaczek-Geiringer。等式求解的实用方法。*ZAMM-应用数学和力学杂志/应用数学和力学杂志*,9(1):58-77,1929年。 - en: '[Mys14]' id: totrans-80 prefs: [] type: TYPE_NORMAL zh: '[Mys14]' - en: Gautham J Mysore. Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges. *IEEE Signal Processing Letters*, 22(8):1006–1010, 2014. id: totrans-81 prefs: [] type: TYPE_NORMAL zh: Gautham J Mysore。我们能否自动将在真实环境中使用普通消费设备录制的语音转换为专业制作质量的语音?—数据集、见解和挑战。*IEEE信号处理通信*,22(8):1006-1010,2014年。 - en: '[NCZ17]' id: totrans-82 prefs: [] type: TYPE_NORMAL zh: '[NCZ17]' - en: 'Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. *arXiv preprint arXiv:1706.08612*, 2017.' id: totrans-83 prefs: [] type: TYPE_NORMAL zh: Arsha Nagrani,Joon Son Chung和Andrew Zisserman。Voxceleb:一个大规模的说话者识别数据集。*arXiv预印本arXiv:1706.08612*,2017年。 - en: '[PCPK15]' id: totrans-84 prefs: [] type: TYPE_NORMAL zh: '[PCPK15]' - en: 'Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In *2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, volume, 5206–5210\. 2015\. [doi:10.1109/ICASSP.2015.7178964](https://doi.org/10.1109/ICASSP.2015.7178964).' id: totrans-85 prefs: [] type: TYPE_NORMAL zh: Vassil Panayotov,Guoguo Chen,Daniel Povey和Sanjeev Khudanpur。Librispeech:基于公共领域有声书的ASR语料库。在*2015年IEEE国际声学、语音和信号处理会议(ICASSP)*中,卷,5206-5210。2015年。[doi:10.1109/ICASSP.2015.7178964](https://doi.org/10.1109/ICASSP.2015.7178964)。 - en: '[PCZ+19]' id: totrans-86 prefs: [] type: TYPE_NORMAL zh: '[PCZ+19]' - en: 'Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Specaugment: a simple data augmentation method for automatic speech recognition. *Interspeech 2019*, Sep 2019\. URL: [http://dx.doi.org/10.21437/Interspeech.2019-2680](http://dx.doi.org/10.21437/Interspeech.2019-2680), [doi:10.21437/interspeech.2019-2680](https://doi.org/10.21437/interspeech.2019-2680).' id: totrans-87 prefs: [] type: TYPE_NORMAL zh: Daniel S. Park,William Chan,Yu Zhang,Chung-Cheng Chiu,Barret Zoph,Ekin D. Cubuk和Quoc V. Le。Specaugment:一种用于自动语音识别的简单数据增强方法。*Interspeech 2019*,2019年9月。网址:[http://dx.doi.org/10.21437/Interspeech.2019-2680](http://dx.doi.org/10.21437/Interspeech.2019-2680),[doi:10.21437/interspeech.2019-2680](https://doi.org/10.21437/interspeech.2019-2680)。 - en: '[PBS13]' id: totrans-88 prefs: [] type: TYPE_NORMAL zh: '[PBS13]' - en: Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. A fast griffin-lim algorithm. In *2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics*, volume, 1–4\. 2013\. [doi:10.1109/WASPAA.2013.6701851](https://doi.org/10.1109/WASPAA.2013.6701851). id: totrans-89 prefs: [] type: TYPE_NORMAL zh: Nathanaël Perraudin,Peter Balazs和Peter L. Søndergaard。一种快速的Griffin-Lim算法。在*2013年IEEE信号处理应用研讨会*中,卷,1-4。2013年。[doi:10.1109/WASPAA.2013.6701851](https://doi.org/10.1109/WASPAA.2013.6701851)。 - en: '[PTS+23]' id: totrans-90 prefs: [] type: TYPE_NORMAL zh: '[PTS+23]' - en: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling speech technology to 1,000+ languages. 2023\. [arXiv:2305.13516](https://arxiv.org/abs/2305.13516). id: totrans-91 prefs: [] type: TYPE_NORMAL zh: Vineel Pratap,Andros Tjandra,Bowen Shi,Paden Tomasello,Arun Babu,Sayani Kundu,Ali Elkahky,Zhaoheng Ni,Apoorv Vyas,Maryam Fazel-Zarandi,Alexei Baevski,Yossi Adi,张晓辉,徐伟宁,Alexis Conneau和Michael Auli。将语音技术扩展到1000多种语言。2023年。arXiv:2305.13516。 - en: '[PXS+20]' id: totrans-92 prefs: [] type: TYPE_NORMAL zh: '[PXS+20]' - en: 'Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. *Interspeech 2020*, Oct 2020\. URL: [http://dx.doi.org/10.21437/Interspeech.2020-2826](http://dx.doi.org/10.21437/Interspeech.2020-2826), [doi:10.21437/interspeech.2020-2826](https://doi.org/10.21437/interspeech.2020-2826).' id: totrans-93 prefs: [] type: TYPE_NORMAL zh: Vineel Pratap,Qiantong Xu,Anuroop Sriram,Gabriel Synnaeve和Ronan Collobert。MLS:用于语音研究的大规模多语言数据集。Interspeech 2020,2020年10月。URL:http://dx.doi.org/10.21437/Interspeech.2020-2826,doi:10.21437/interspeech.2020-2826。 - en: '[RLStoter+19]' id: totrans-94 prefs: [] type: TYPE_NORMAL zh: '[RLStoter+19]' - en: 'Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. MUSDB18-HQ - an uncompressed version of musdb18\. December 2019\. URL: [https://doi.org/10.5281/zenodo.3338373](https://doi.org/10.5281/zenodo.3338373), [doi:10.5281/zenodo.3338373](https://doi.org/10.5281/zenodo.3338373).' id: totrans-95 prefs: [] type: TYPE_NORMAL zh: Zafar Rafii,Antoine Liutkus,Fabian-Robert Stöter,Stylianos Ioannis Mimilakis和Rachel Bittner。MUSDB18-HQ - musdb18的未压缩版本。2019年12月。URL:https://doi.org/10.5281/zenodo.3338373,doi:10.5281/zenodo.3338373。 - en: '[RGC+20]' id: totrans-96 prefs: [] type: TYPE_NORMAL zh: '[RGC+20]' - en: 'Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, and others. The interspeech 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. *arXiv preprint arXiv:2005.13981*, 2020.' id: totrans-97 prefs: [] type: TYPE_NORMAL zh: Chandan KA Reddy,Vishak Gopal,Ross Cutler,Ebrahim Beyrami,Roger Cheng,Harishchandra Dubey,Sergiy Matusevych,Robert Aichner,Ashkan Aazami,Sebastian Braun等人。Interspeech 2020深度降噪挑战:数据集,主观测试框架和挑战结果。arXiv预印本arXiv:2005.13981,2020年。 - en: '[RDelegliseEsteve12]' id: totrans-98 prefs: [] type: TYPE_NORMAL zh: '[RDelegliseEsteve12]' - en: 'Anthony Rousseau, Paul Deléglise, and Yannick Estève. Ted-lium: an automatic speech recognition dedicated corpus. In *Conference on Language Resources and Evaluation (LREC)*, 125–129\. 2012.' id: totrans-99 prefs: [] type: TYPE_NORMAL zh: 安东尼·鲁索,保罗·德勒格利斯和亚尼克·埃斯特韦。Ted-lium:一种专用于自动语音识别的语料库。在语言资源和评估会议(LREC)中,125-129页。2012年。 - en: '[SY18]' id: totrans-100 prefs: [] type: TYPE_NORMAL zh: '[SY18]' - en: Seyyed Saeed Sarfjoo and Junichi Yamagishi. Device recorded vctk (small subset version). 2018. id: totrans-101 prefs: [] type: TYPE_NORMAL zh: Seyyed Saeed Sarfjoo和山岸淳一。设备录制的vctk(小型子集版本)。2018年。 - en: '[SBDokmanic18]' id: totrans-102 prefs: [] type: TYPE_NORMAL zh: '[SBDokmanic18]' - en: 'Robin Scheibler, Eric Bezzam, and Ivan Dokmanić. Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In *2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)*, 351–355\. IEEE, 2018.' id: totrans-103 prefs: [] type: TYPE_NORMAL zh: 罗宾·施伯勒,埃里克·贝扎姆和伊万·多克曼尼奇。Pyroomacoustics:用于音频房间模拟和阵列处理算法的Python软件包。在2018年IEEE国际声学、语音和信号处理会议(ICASSP)中,351-355页。IEEE,2018年。 - en: '[SPW+18]' id: totrans-104 prefs: [] type: TYPE_NORMAL zh: '[SPW+18]' - en: Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In *2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 4779–4783\. IEEE, 2018. id: totrans-105 prefs: [] type: TYPE_NORMAL zh: 乔纳森·申,Ruoming Pang,Ron J Weiss,Mike Schuster,Navdeep Jaitly,Zongheng Yang,Zhifeng Chen,张宇,王宇轩,Rj Skerrv-Ryan等人。通过在mel频谱图预测上对wavenet进行条件化的自然tts合成。在2018年IEEE国际声学、语音和信号处理会议(ICASSP)中,4779-4783页。IEEE,2018年。 - en: '[SWW+21]' id: totrans-106 prefs: [] type: TYPE_NORMAL zh: '[SWW+21]' - en: 'Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In *ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 6783–6787\. 2021.' id: totrans-107 prefs: [] type: TYPE_NORMAL zh: 杨洋石,王永强,吴春阳,叶青峰,陈俊,张弗兰克,勒杜克和迈克·塞尔策。Emformer:用于低延迟流式语音识别的高效内存变压器基础声学模型。在ICASSP 2021 - 2021年IEEE国际声学、语音和信号处理会议(ICASSP)中,6783-6787页。2021年。 - en: '[SWW+22]' id: totrans-108 prefs: [] type: TYPE_NORMAL zh: '[SWW+22]' - en: Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, and Mike Seltzer. Streaming transformer transducer based speech recognition using non-causal convolution. In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, volume, 8277–8281\. 2022\. [doi:10.1109/ICASSP43922.2022.9747706](https://doi.org/10.1109/ICASSP43922.2022.9747706). id: totrans-109 prefs: [] type: TYPE_NORMAL zh: 杨洋石,春阳吴,迪林王,Alex Xiao,Jay Mahadeokar,张晓辉,刘春喜,李克,尚冠元,瓦伦·纳加拉贾,奥兹莱姆·卡林利和迈克·塞尔策。基于非因果卷积的流式变压器传导器语音识别。在ICASSP 2022 - 2022年IEEE国际声学、语音和信号处理会议(ICASSP)中,卷,8277-8281页。2022年。doi:10.1109/ICASSP43922.2022.9747706。 - en: '[Smi20]' id: totrans-110 prefs: [] type: TYPE_NORMAL zh: '[Smi20]' - en: 'Julius O. Smith. Digital audio resampling home page "theory of ideal bandlimited interpolation" section. September 2020\. URL: [https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html](https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html).' id: totrans-111 prefs: [] type: TYPE_NORMAL zh: 朱利叶斯·O·史密斯。数字音频重采样主页“理想带限插值理论”部分。2020年9月。URL:https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html。 - en: '[SCP15]' id: totrans-112 prefs: [] type: TYPE_NORMAL zh: '[SCP15]' - en: 'David Snyder, Guoguo Chen, and Daniel Povey. MUSAN: A Music, Speech, and Noise Corpus. 2015\. arXiv:1510.08484v1\. [arXiv:1510.08484](https://arxiv.org/abs/1510.08484).' id: totrans-113 prefs: [] type: TYPE_NORMAL zh: 大卫·斯奈德,陈国国和丹尼尔·波维。MUSAN:一个音乐、语音和噪声语料库。2015年。arXiv:1510.08484v1。arXiv:1510.08484。 - en: '[SBA09]' id: totrans-114 prefs: [] type: TYPE_NORMAL zh: '[SBA09]' - en: Mehrez Souden, Jacob Benesty, and Sofiene Affes. On optimal frequency-domain multichannel linear filtering for noise reduction. In *IEEE Transactions on audio, speech, and language processing*, volume 18, 260–276\. IEEE, 2009. id: totrans-115 prefs: [] type: TYPE_NORMAL zh: Mehrez Souden,Jacob Benesty和Sofiene Affes。关于噪声降低的最佳频域多通道线性滤波。在IEEE音频、语音和语言处理交易中,卷18,260-276页。IEEE,2009年。 - en: '[SWT+22]' id: totrans-116 prefs: [] type: TYPE_NORMAL zh: '[SWT+22]' - en: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, and Yatharth Saraf. Conformer-based self-supervised learning for non-speech audio tasks. In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, volume, 8862–8866\. 2022\. [doi:10.1109/ICASSP43922.2022.9746490](https://doi.org/10.1109/ICASSP43922.2022.9746490). id: totrans-117 prefs: [] type: TYPE_NORMAL zh: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, and Yatharth Saraf. Conformer-based self-supervised learning for non-speech audio tasks. In *ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, volume, 8862–8866\. 2022\. [doi:10.1109/ICASSP43922.2022.9746490](https://doi.org/10.1109/ICASSP43922.2022.9746490). - en: '[TEC01]' id: totrans-118 prefs: [] type: TYPE_NORMAL zh: '[TEC01]' - en: 'George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification of audio signals. 2001\. URL: [http://ismir2001.ismir.net/pdf/tzanetakis.pdf](http://ismir2001.ismir.net/pdf/tzanetakis.pdf).' id: totrans-119 prefs: [] type: TYPE_NORMAL zh: 'George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification of audio signals. 2001\. URL: [http://ismir2001.ismir.net/pdf/tzanetakis.pdf](http://ismir2001.ismir.net/pdf/tzanetakis.pdf).' - en: '[VAlumae21]' id: totrans-120 prefs: [] type: TYPE_NORMAL zh: '[VAlumae21]' - en: 'Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. In *2021 IEEE Spoken Language Technology Workshop (SLT)*, 652–658\. IEEE, 2021.' id: totrans-121 prefs: [] type: TYPE_NORMAL zh: 'Jörgen Valk and Tanel Alumäe. Voxlingua107: a dataset for spoken language recognition. In *2021 IEEE Spoken Language Technology Workshop (SLT)*, 652–658\. IEEE, 2021.' - en: '[WRiviereL+21]' id: totrans-122 prefs: [] type: TYPE_NORMAL zh: '[WRiviereL+21]' - en: 'Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. *CoRR*, 2021\. URL: [https://arxiv.org/abs/2101.00390](https://arxiv.org/abs/2101.00390), [arXiv:2101.00390](https://arxiv.org/abs/2101.00390).' id: totrans-123 prefs: [] type: TYPE_NORMAL zh: 'Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. *CoRR*, 2021\. URL: [https://arxiv.org/abs/2101.00390](https://arxiv.org/abs/2101.00390), [arXiv:2101.00390](https://arxiv.org/abs/2101.00390).' - en: '[Wei98]' id: totrans-124 prefs: [] type: TYPE_NORMAL zh: '[Wei98]' - en: 'R.L. Weide. The carnegie mellon pronuncing dictionary. 1998\. URL: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).' id: totrans-125 prefs: [] type: TYPE_NORMAL zh: 'R.L. Weide. The carnegie mellon pronuncing dictionary. 1998\. URL: [http://www.speech.cs.cmu.edu/cgi-bin/cmudict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).' - en: '[YVM19]' id: totrans-126 prefs: [] type: TYPE_NORMAL zh: '[YVM19]' - en: 'Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019\. [doi:10.7488/ds/2645](https://doi.org/10.7488/ds/2645).' id: totrans-127 prefs: [] type: TYPE_NORMAL zh: 'Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019\. [doi:10.7488/ds/2645](https://doi.org/10.7488/ds/2645).' - en: '[ZDC+19]' id: totrans-128 prefs: [] type: TYPE_NORMAL zh: '[ZDC+19]' - en: 'Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Z. Chen, and Yonghui Wu. Libritts: a corpus derived from librispeech for text-to-speech. *ArXiv*, 2019.' id: totrans-129 prefs: [] type: TYPE_NORMAL zh: 'Heiga Zen, Viet-Trung Dang, Robert A. J. Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Z. Chen, and Yonghui Wu. Libritts: a corpus derived from librispeech for text-to-speech. *ArXiv*, 2019.' - en: '[ZSN21]' id: totrans-130 prefs: [] type: TYPE_NORMAL zh: '[ZSN21]' - en: Albert Zeyer, Ralf Schlüter, and Hermann Ney. Why does ctc result in peaky behavior? 2021\. [arXiv:2105.14849](https://arxiv.org/abs/2105.14849). id: totrans-131 prefs: [] type: TYPE_NORMAL zh: Albert Zeyer, Ralf Schlüter, and Hermann Ney. Why does ctc result in peaky behavior? 2021\. [arXiv:2105.14849](https://arxiv.org/abs/2105.14849). - en: '[BrianMcFeeColinRaffelDawenLiang+15]' id: totrans-132 prefs: [] type: TYPE_NORMAL zh: '[BrianMcFeeColinRaffelDawenLiang+15]' - en: 'Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python. In Kathryn Huff and James Bergstra, editors, *Proceedings of the 14th Python in Science Conference*, 18 – 24\. 2015\. [doi:10.25080/Majora-7b98e3ed-003](https://doi.org/10.25080/Majora-7b98e3ed-003).' id: totrans-133 prefs: [] type: TYPE_NORMAL zh: 'Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python. In Kathryn Huff and James Bergstra, editors, *Proceedings of the 14th Python in Science Conference*, 18 – 24\. 2015\. [doi:10.25080/Majora-7b98e3ed-003](https://doi.org/10.25080/Majora-7b98e3ed-003).' - en: '[KahnRiviereZheng+20]' id: totrans-134 prefs: [] type: TYPE_NORMAL zh: '[KahnRiviereZheng+20]' - en: 'J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In *ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 7669–7673\. 2020\. [https://github.com/facebookresearch/libri-light](https://github.com/facebookresearch/libri-light).' id: totrans-135 prefs: [] type: TYPE_NORMAL zh: 'J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In *ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 7669–7673\. 2020\. [https://github.com/facebookresearch/libri-light](https://github.com/facebookresearch/libri-light).' - en: '[Warden18]' id: totrans-136 prefs: [] type: TYPE_NORMAL zh: '[Warden18]' - en: 'P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. *ArXiv e-prints*, April 2018\. URL: [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209), [arXiv:1804.03209](https://arxiv.org/abs/1804.03209).' id: totrans-137 prefs: [] type: TYPE_NORMAL zh: 'P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. *ArXiv e-prints*, April 2018\. URL: [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209), [arXiv:1804.03209](https://arxiv.org/abs/1804.03209).' - en: '[Wikipediacontributors]' id: totrans-138 prefs: [] type: TYPE_NORMAL zh: '[Wikipediacontributors]' - en: 'Wikipedia contributors. Absorption (acoustics) — Wikipedia, the free encyclopedia. [Online]. URL: [https://en.wikipedia.org/wiki/Absorption_(acoustics)](https://en.wikipedia.org/wiki/Absorption_(acoustics)).' id: totrans-139 prefs: [] type: TYPE_NORMAL zh: 'Wikipedia contributors. Absorption (acoustics) — Wikipedia, the free encyclopedia. [Online]. URL: [https://en.wikipedia.org/wiki/Absorption_(acoustics)](https://en.wikipedia.org/wiki/Absorption_(acoustics)).'