Paper Digest: Recent Papers on Speech Recognition
Paper Digest Team extracted all recent Speech Recognition related papers on our radar, and generated highlight sentences for them. The results are then sorted by relevance & date. In addition to this ‘static’ page, we also provide a real-time version of this article, which has more coverage and is updated in real time to include the most recent updates on this topic.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive updates on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: Recent Papers on Speech Recognition
Paper | Author(s) | Source | Date | |
---|---|---|---|---|
1 | Improving Named Entity Transcription with Contextual LLM-based Revision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. |
Viet Anh Trinh; Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-06-12 |
2 | Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems’ performance for aphasia speakers. |
Katelyn Xiaoying Mei; Anna Seo Gyeong Choi; Hilke Schellmann; Mona Sloane; Allison Koenecke; | arxiv-cs.CY | 2025-06-10 |
3 | SimClass: A Classroom Speech Dataset Generated Via Game Engine Simulation For Automatic Speech Recognition Research Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a scalable methodology for synthesizing classroom noise using game engines, a framework that extends to other domains. |
Ahmed Adel Attia; Jing Liu; Carl Espy-Wilson; | arxiv-cs.SD | 2025-06-10 |
4 | DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. |
SOLEE IM et. al. | arxiv-cs.CL | 2025-06-09 |
5 | Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This mechanism presents several issues, including unnecessary energy consumption, communication latency, and the risk of a single-point failure. In this position paper, we propose a smart home concept based on offline speech recognition and IoT technology: 1) integrating offline keyword spotting (KWS) technologies into household appliances with limited resource hardware to enable them to understand user voice commands; 2) designing a local IoT network with decentralized architecture to manage and connect various devices, enhancing the robustness and scalability of the system. |
PENG HUANG et. al. | arxiv-cs.SD | 2025-06-09 |
6 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. |
Rui Hu; Xiaolong Lin; Jiawang Liu; Shixi Huang; Zhenpeng Zhan; | arxiv-cs.CL | 2025-06-09 |
7 | Technical Report: A Practical Guide to Kaldi ASR Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This technical report introduces innovative optimizations for Kaldi-based Automatic Speech Recognition (ASR) systems, focusing on acoustic model enhancement, hyperparameter tuning, and language model efficiency. |
Mengze Hong; Di Jiang; | arxiv-cs.SD | 2025-06-08 |
8 | Speech Recognition on TV Series with Video-guided Post-Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing multimodal approaches fail to correct ASR outputs with the rich temporal and contextual information available in video. To address this limitation, we propose a novel multimodal post-correction framework that refines ASR transcriptions by leveraging contextual cues extracted from video. |
Haoyuan Yang; Yue Zhang; Liqiang Jing; | arxiv-cs.SD | 2025-06-08 |
9 | Automatic Speech Recognition of African American English: Lexical and Contextual Effects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. |
Hamid Mojarad; Kevin Tang; | arxiv-cs.CL | 2025-06-07 |
10 | Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model’s output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. |
CHRISTOPHE VAN GYSEL et. al. | arxiv-cs.CL | 2025-06-06 |
11 | ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. |
Thai-Binh Nguyen; Thi Van Nguyen; Quoc Truong Do; Chi Mai Luong; | arxiv-cs.CL | 2025-06-05 |
12 | LLM-based Phoneme-to-grapheme for Phoneme-based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). |
Te Ma; Min Bi; Saierdaer Yusuyin; Hao Huang; Zhijian Ou; | arxiv-cs.SD | 2025-06-05 |
13 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. |
Wen Ding; Fan Qian; | arxiv-cs.CL | 2025-06-04 |
14 | Overcoming Data Scarcity in Multi-Dialectal Arabic ASR Via Whisper Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. |
Ömer Tarik Özyilmaz; Matt Coler; Matias Valdenegro-Toro; | arxiv-cs.CL | 2025-06-03 |
15 | A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. |
Verena Blaschke; Miriam Winkler; Constantin Förster; Gabriele Wenger-Glemser; Barbara Plank; | arxiv-cs.CL | 2025-06-03 |
16 | Whale: Large-Scale Multilingual ASR Model with W2v-BERT and E-Branchformer with Large Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper reports on the development of a large-scale speech recognition model, Whale. |
Yosuke Kashiwagi; Hayato Futami; Emiru Tsunoo; Satoshi Asakawa; | arxiv-cs.CL | 2025-06-02 |
17 | HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. |
AMIR HUSSEIN et. al. | arxiv-cs.CL | 2025-06-02 |
18 | Fine-Tuning ASR for Stuttered Speech: Personalized Vs. Generalized Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate fine-tuning ASRs for stuttered speech, comparing generalized models (trained across multiple speakers) to personalized models tailored to individual speech characteristics. |
Dena Mujtaba; Nihar Mahapatra; | arxiv-cs.SD | 2025-06-01 |
19 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data’s vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. |
Yu Nakagome; Michael Hentschel; | arxiv-cs.CL | 2025-06-01 |
20 | Causal Structure Discovery for Error Diagnostics of Children’s ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. |
Vishwanath Pratap Singh; Md. Sahidullah; Tomi Kinnunen; | arxiv-cs.CL | 2025-05-31 |
21 | Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. |
Qingzheng Wang; Jiancheng Sun; Yifan Peng; Shinji Watanabe; | arxiv-cs.SD | 2025-05-30 |
22 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. |
YANGUI FANG et. al. | arxiv-cs.CL | 2025-05-30 |
23 | Pseudo Labels-based Neural Speech Enhancement for The AVSR Task in The MISP-Meeting Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents our system for the MISP-Meeting Challenge Track 2. |
Longjie Luo; Shenghui Lu; Lin Li; Qingyang Hong; | arxiv-cs.SD | 2025-05-30 |
24 | Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. |
SUJEET KUMAR et. al. | arxiv-cs.CL | 2025-05-30 |
25 | MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). |
Dimitrios Damianos; Georgios Paraskevopoulos; Alexandros Potamianos; | arxiv-cs.CL | 2025-05-30 |
26 | Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. |
Griffin Dietz Smith; Dianna Yee; Jennifer King Chen; Leah Findlater; | arxiv-cs.LG | 2025-05-29 |
27 | Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. |
YOUJUN CHEN et. al. | arxiv-cs.SD | 2025-05-29 |
28 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. |
Zhennan Lin; Kaixun Huang; Wei Ren; Linju Yang; Lei Xie; | arxiv-cs.SD | 2025-05-29 |
29 | Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. |
SHANGKUN HUANG et. al. | arxiv-cs.SD | 2025-05-28 |
30 | Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. |
MINGCHEN SHAO et. al. | arxiv-cs.SD | 2025-05-28 |
31 | AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark. |
YUHANG DAI et. al. | arxiv-cs.SD | 2025-05-28 |
32 | Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. |
Stefan Bleeck; | arxiv-cs.SD | 2025-05-28 |
33 | Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposed an LLM-driven ASR-SED multi-task learning framework that jointly optimized the ASR and Stuttering Event Detection (SED) tasks. |
Shangkun Huang; Jing Deng; Jintao Kang; Rong Zheng; | arxiv-cs.SD | 2025-05-28 |
34 | Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. |
TIANYI XU et. al. | arxiv-cs.CL | 2025-05-27 |
35 | Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. |
Titouan Parcollet; Yuan Tseng; Shucong Zhang; Rogier van Dalen; | arxiv-cs.CL | 2025-05-27 |
36 | Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. |
ZHAOQING LI et. al. | arxiv-cs.SD | 2025-05-27 |
37 | GMU Systems for The IWSLT 2025 Low-Resource Speech Translation Shared Task Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. |
Chutong Meng; Antonios Anastasopoulos; | arxiv-cs.CL | 2025-05-27 |
38 | KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. |
ZHAOLIN LI et. al. | arxiv-cs.CL | 2025-05-26 |
39 | Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. |
Dancheng Liu; Amir Nassereldine; Chenhui Xu; Jinjun Xiong; | arxiv-cs.CL | 2025-05-26 |
40 | In-context Language Learning for Endangered Languages in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). |
Zhaolin Li; Jan Niehues; | arxiv-cs.CL | 2025-05-26 |
41 | Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. |
Raphaël Bagat; Irina Illina; Emmanuel Vincent; | arxiv-cs.CL | 2025-05-26 |
42 | Exploring Generative Error Correction for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). |
Moreno La Quatra; Alkis Koudounas; Valerio Mario Salerno; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-05-26 |
43 | BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. |
Xun Gong; Anqi Lv; Zhiming Wang; Huijia Zhu; Yanmin Qian; | arxiv-cs.SD | 2025-05-25 |
44 | An Effective Training Framework for Light-Weight Automatic Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. |
Abdul Hannan; Alessio Brutti; Shah Nawaz; Mubashir Noman; | arxiv-cs.CV | 2025-05-22 |
45 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. |
Tianduo Wang; Lu Xu; Wei Lu; Shanbo Cheng; | arxiv-cs.CL | 2025-05-22 |
46 | Large Language Models Based ASR Error Correction for Child Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. |
ANFENG XU et. al. | arxiv-cs.CL | 2025-05-22 |
47 | Differentiable K-means for Fully-optimized Discrete Token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. |
Kentaro Onda; Yosuke Kashiwagi; Emiru Tsunoo; Hayato Futami; Shinji Watanabe; | arxiv-cs.SD | 2025-05-22 |
48 | Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: An Analytical Study Towards Accent-robust ASR Only with Native Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we gained insight that contributes to achieving accent-robust ASR using only native speech data. |
Kentaro Onda; Keisuke Imoto; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2025-05-21 |
49 | Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. |
Hongfei Xue; Yufeng Tang; Jun Zhang; Xuelong Geng; Lei Xie; | arxiv-cs.SD | 2025-05-21 |
50 | Prosodically Enhanced Foreign Accent Simulation By Discrete Token-based Resynthesis Only with Native Speech Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. |
Kentaro Onda; Keisuke Imoto; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2025-05-21 |
51 | PersonaTAB: Predicting Personality Traits Using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. |
Sho Inoue; Shai Wang; Haizhou Li; | arxiv-cs.SD | 2025-05-20 |
52 | Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. |
HAOYANG ZHANG et. al. | arxiv-cs.CL | 2025-05-20 |
53 | Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice … |
CHIN-JOU LI et. al. | arxiv-cs.CL | 2025-05-20 |
54 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2025-05-20 |
55 | Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present our submission to the Speech Accessibility Project challenge for dysarthric speech recognition. |
DOMINIK WAGNER et. al. | arxiv-cs.SD | 2025-05-19 |
56 | LegoSLM: Connecting LLM with Speech Encoder Using CTC Posteriors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. |
Rao Ma; Tongzhou Chen; Kartik Audhkhasi; Bhuvana Ramabhadran; | arxiv-cs.CL | 2025-05-16 |
57 | Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. |
Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-05-16 |
58 | ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. |
Anand Rai; Satyam Rahangdale; Utkarsh Anand; Animesh Mukherjee; | arxiv-cs.SD | 2025-05-16 |
59 | Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. |
SUKAIRAJ HAFIZ IMAM et. al. | arxiv-cs.CL | 2025-05-16 |
60 | Multi-Stage Speaker Diarization for Noisy Classrooms Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study investigates the effectiveness of multi-stage diarization models using Nvidia’s NeMo diarization pipeline. |
Ali Sartaz Khan; Tolulope Ogunremi; Ahmed Adel Attia; Dorottya Demszky; | arxiv-cs.SD | 2025-05-16 |
61 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. |
Linrong Pan; Chenglong Jiang; Gaoze Hou; Ying Gao; | arxiv-cs.CL | 2025-05-08 |
62 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. |
ZUWEI LONG et. al. | arxiv-cs.CL | 2025-05-06 |
63 | Pisets: A Robust Speech Recognition System for Lectures and Interviews Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents a speech-to-text system “Pisets” for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. |
IVAN BONDARENKO et. al. | naacl | 2025-05-04 |
64 | Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. |
YANG LI et. al. | naacl | 2025-05-04 |
65 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | naacl | 2025-05-04 |
66 | AMPS: ASR with Multimodal Paraphrase Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a new technique AMPS, that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | naacl | 2025-05-04 |
67 | Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Wav2Prompt uses a straightforward training process with only the same data used to train an automatic speech recognition (ASR) model. |
Keqi Deng; Guangzhi Sun; Phil Woodland; | naacl | 2025-05-04 |
68 | Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). |
Máté Gedeon; | arxiv-cs.CL | 2025-04-30 |
69 | BERSting at The Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. |
PAIGE TUTTÖSÍ et. al. | arxiv-cs.CL | 2025-04-30 |
70 | Chinese-LiPS: A Chinese Audio-visual Speech Recognition Dataset with Lip-reading and Presentation Slides Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. |
JINGHUA ZHAO et. al. | arxiv-cs.MM | 2025-04-21 |
71 | Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. |
MAHMOUD SALHAB et. al. | arxiv-cs.AI | 2025-04-16 |
72 | Dysarthric Speech Conformer: Adaptation for Sequence-to-Sequence Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a two-phase adaptation pipeline based on the Conformer architecture that leverages typical speech to transfer to individualized ASR models for dysarthric speakers. |
Q. Wang; | icassp | 2025-04-15 |
73 | Retrieval Augmented Correction of Named Entity Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. |
E. Pusateri; | icassp | 2025-04-15 |
74 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
J. Zhou; | icassp | 2025-04-15 |
75 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
H. Wang; | icassp | 2025-04-15 |
76 | From Characters to Subwords: Modeling Unit Conversion for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel low-resource ASR method that leverages the advantages of two different modeling units. |
Y. Wang; H. Zhang; H. Wang; L. Sun; M. Song; | icassp | 2025-04-15 |
77 | Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a Fast estimator for WER (Fe-WER) is introduced, utilizing average pooling over self-supervised learning representations for speech and text. |
C. Park; C. Lu; M. Chen; T. Hain; | icassp | 2025-04-15 |
78 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversations between adults. |
G. Maheshwari; D. Ivanov; T. Johannet; K. El Haddad; | icassp | 2025-04-15 |
79 | Advancing Streaming ASR with Chunk-wise Attention and Trans-chunk Selective State Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores enhancing streaming speech recognition through the integration of chunk-wise attention and selective state space models (SSMs). |
M. Mimura; T. Moriya; K. Matsuura; | icassp | 2025-04-15 |
80 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
C. -C. WANG et. al. | icassp | 2025-04-15 |
81 | Revise, Reason, and Recognize: LLM-Based Emotion Recognition Via Emotion-Specific Prompts and ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. |
Y. Li; Y. Gong; C. -H. H. Yang; P. Bell; C. Lai; | icassp | 2025-04-15 |
82 | Mamba for Streaming ASR Combined with Unimodal Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. |
Y. Fang; X. Li; | icassp | 2025-04-15 |
83 | ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenge of high annotation costs in text-to-speech (TTS) generation, this paper introduces a semi-supervised learning framework specifically designed for low-resource TTS scenarios. |
F. Li; S. Chen; H. Yang; S. Yuan; | icassp | 2025-04-15 |
84 | Improving Dialect Identification in Indian Languages Using Multimodal Features from Dialect Informed ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces a novel multimodal architecture that leverages speech and text features to enhance DID performance. |
icassp | 2025-04-15 | |
85 | Speech Recognition for Automatically Assessing Afrikaans and IsiXhosa Preschool Oral Narratives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. |
C. JACOBS et. al. | icassp | 2025-04-15 |
86 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
T. Parcollet; R. van Dalen; S. Zhang; S. Bhattacharya; | icassp | 2025-04-15 |
87 | AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition Using Agnostic Contrastive Mixup Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). |
C. Carvalho; A. Abad; | icassp | 2025-04-15 |
88 | Self-Information Guided Speech Segmentation for Efficient Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel method that leverages self-information, a measure of the information contained within an utterance, as a supervisory signal for speech segmentation. |
W. S. Teo; Y. Minami; | icassp | 2025-04-15 |
89 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
Z. Zhuang; | icassp | 2025-04-15 |
90 | Contextual ASR with Retrieval Augmented Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose leveraging large language models (LLMs) and retrieval-augmented generation (RAG) to enhance the contextual capabilities of ASR systems. |
C. Xiao; Z. Hou; D. Garcia-Romero; K. J. Han; | icassp | 2025-04-15 |
91 | StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose StableQuant, a novel adaptive post-training quantization (PTQ) algorithm for widely used speech foundation models (SFMs). |
Y. Hong; H. Han; W. -J. Chung; H. -G. Kang; | icassp | 2025-04-15 |
92 | EFL-PEFT: A Communication Efficient Federated Learning Framework Using PEFT Sparsification for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we consolidate the use of PEFT for ASR with pre-trained models, demonstrating that it enables efficient FL reducing the amount of parameters to share with respect to full fine-tuning. |
M. N. Ali; D. Falavigna; A. Brutti; | icassp | 2025-04-15 |
93 | Generating Targeted Universal Adversarial Perturbation Against Automatic Speech Recognition Via Phoneme Tailoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, to improve attack ability, we propose a Diverse Audio Composition Enrichment method, which enhances the utilization of audio features through phoneme-level slicing and recombination. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
94 | Chain-of-Thought Prompting for Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
K. Hu; | icassp | 2025-04-15 |
95 | META-CAT: Speaker-Informed Speech Embeddings Via Meta Information Concatenation for Multi-talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. |
J. Wang; | icassp | 2025-04-15 |
96 | Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). |
N. Moritz; | icassp | 2025-04-15 |
97 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
L. Meng; | icassp | 2025-04-15 |
98 | Improved Recognition of The Speech of People with Parkinson’s Who Stutter Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a novel stuttered speech data augmentation approach to improve dysarthric speech recognition. |
J. Na; X. Zheng; B. Lee; M. Hasegawa-Johnson; | icassp | 2025-04-15 |
99 | A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. |
Y. Jiang; B. Tian; | icassp | 2025-04-15 |
100 | PersoDA: Personalized Data Augmentation for Personalized ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user’s data utilized to personalize ASR [1] –[3]. |
P. P. Parada; | icassp | 2025-04-15 |
101 | Continuously Learning New Words in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. |
C. Huber; A. Waibel; | icassp | 2025-04-15 |
102 | Harnessing The Zero-Shot Power of Instruction-Tuned Large Language Model for Guiding End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). |
Y. Higuchi; T. Ogawa; T. Kobayashi; | icassp | 2025-04-15 |
103 | Towards A Single ASR Model That Generalizes to Disordered Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study investigates the impact of integrating a dataset of disordered speech recordings (~1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. |
J. Tobin; K. Tomanek; S. Venugopalan; | icassp | 2025-04-15 |
104 | Joint Training Framework for Accent and Speech Recognition Based on Conformer Low-Rank Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces the Conformer Low-rank Adaptation for Joint Accent and Speech Recognition (CLAnSR), employing LoRA to augment both ASR and AR capabilities using a shared pre-trained base encoder. |
X. Zhuang; Y. Qian; S. Xu; M. Wang; | icassp | 2025-04-15 |
105 | Speech Recognition with LLMs Adapted to Disordered Speech Using Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. |
C. NAGPAL et. al. | icassp | 2025-04-15 |
106 | Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps. |
N. L. Kühne; | icassp | 2025-04-15 |
107 | Alignment-Free Training for Transducer-based Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. |
T. Moriya; | icassp | 2025-04-15 |
108 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) scenario in Automatic Speech Recognition (ASR). |
F. ZHANG et. al. | icassp | 2025-04-15 |
109 | ValSub: Subsampling Validation Data to Mitigate Forgetting During ASR Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. |
H. Mehmood; | icassp | 2025-04-15 |
110 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. |
J. HU et. al. | icassp | 2025-04-15 |
111 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
T. -B. Nguyen; A. Waibel; | icassp | 2025-04-15 |
112 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
K. Mundnich; | icassp | 2025-04-15 |
113 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
T. Feng; | icassp | 2025-04-15 |
114 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
S. Basak; A. Padarthi; M. Gowda; | icassp | 2025-04-15 |
115 | Using Corrected ASR Projection to Improve AD Recognition Performance from Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, Automatic Speech Recognition transcription errors, stemming from language impairments in AD and Mild Cognitive Impairment patients, can lead to information loss during feature extraction. To mitigate this, we introduce the Corrected ASR Projecting, CAP model. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
116 | Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel contextual biasing approach that employs the Multi-label Synchronous Output CTC (MCTC) algorithm to enhance the synchronization between ASR and bias task outputs. |
M. Fang; | icassp | 2025-04-15 |
117 | Speech Enhancement with MAP-based Training for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, this study proposes a maximum a posteriori (MAP) algorithm for training SE models by incorporating the posterior probability of clean speech, given the enhanced speech, into the loss function. |
Y. -J. Li; R. Chao; B. Su; Y. Tsao; | icassp | 2025-04-15 |
118 | Audio Diffusion with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore an alternate approach to the popular method of using large language models (LLMs) as a second decoder for Automated Speech Recognition (ASR) and speech understanding tasks. |
Y. Huang; K. Kastner; K. Audhkhasi; B. Ramabhadran; A. Rosenberg; | icassp | 2025-04-15 |
119 | Speech Re-Painting for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce speech re-painting, a method for in-context augmented synthesis, using target training datasets to generate new utterances guided by speech and text on the fly in a zero-shot manner. |
K. Kastner; | icassp | 2025-04-15 |
120 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
T. Abu; Y. Shi; T. F. Zheng; D. Wang; | icassp | 2025-04-15 |
121 | Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) model. |
R. Fukuda; T. Kano; A. Ando; A. Ogawa; | icassp | 2025-04-15 |
122 | Learning Rich Speech Representations with Acoustic-Semantic Factorization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Second, the entanglement of acoustic and semantic information can undermine model robustness, particularly in varied acoustic environments. To address these issues, we propose a two-branch multitask finetuning strategy that integrates Automatic Speech Recognition and transcript-aligned audio reconstruction, designed to preserve and disentangle semantic and acoustic information in a final layer of a pretrained model. |
M. Niu; | icassp | 2025-04-15 |
123 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. |
A. A. Attia; D. Demszky; T. Ògúnremí; J. Liu; C. Espy-Wilson; | icassp | 2025-04-15 |
124 | Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to decouple the training of the multi-channel speaker separation frontend and the ASR backend, with the latter trained only on clean speech. |
Y. Yang; H. Taherian; V. A. Kalkhorani; D. Wang; | icassp | 2025-04-15 |
125 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
T. Hori; M. Kocour; A. Haider; E. McDermott; X. Zhuang; | icassp | 2025-04-15 |
126 | Identifying and Mitigating Mismatched Language Code in Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This language mismatch can significantly reduce ASR quality. We present a technique to identify and mitigate this issue. |
J. Kim; | icassp | 2025-04-15 |
127 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
N. B. Shankar; Z. Wang; E. Eren; A. Alwan; | icassp | 2025-04-15 |
128 | LLM Supervised Pre-training for Multimodal Emotion Recognition in Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. |
S. Dutta; S. Ganapathy; | icassp | 2025-04-15 |
129 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
B. Yan; V. Pratap; S. Watanabe; M. Auli; | icassp | 2025-04-15 |
130 | Speech Few-Shot Learning for Language Learners’ Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper reports how speech recognition accuracy can be improved using the speech few-shot in-context learning capabilities of a multimodal foundation model when applied to the speech of language learners. |
J. Cheng; S. Nguyen; | icassp | 2025-04-15 |
131 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
Z. Cui; | icassp | 2025-04-15 |
132 | Efficient Long-Form Speech Recognition for General Speech In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. |
H. Yen; S. Ling; G. Ye; | icassp | 2025-04-15 |
133 | Bridging The Modality Gap for Speech-image Retrieval with Text Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to leverage text supervision to facilitate the alignment between speech and image feature spaces via an automatic speech recognition (ASR) auxiliary task. |
Y. Yang; L. Zhou; Y. Li; G. Ma; | icassp | 2025-04-15 |
134 | Chinese Speech Processing Via Chinese Character Feature Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper focuses on the basic structure of Chinese characters: semantic-phonetic compound characters. This paper takes advantage of this feature of Chinese characters and innovatively proposes a Chinese speech-processing method based on character shape. |
R. Jiang; Z. Yang; W. Xi; X. Fu; J. Zhao; | icassp | 2025-04-15 |
135 | LLM Based Text Generation for Improved Low-resource Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prompting a large language model (LLM) to paraphrase input text can generate novel text data that is constrained to be semantically similar to the source data. We leverage this capability of LLMs to improve the performance of low-resource ASR systems by increasing the limited text training data while keeping the same spoken style. |
T. Nagano; | icassp | 2025-04-15 |
136 | Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While various methods have enhanced recognition accuracy in SE-ASR systems, they often require fine-tuning or re-training of SE or ASR models, which is impractical in many real-world applications. In this paper, we propose a lightweight distortion suppression (DS) network that addresses these artifacts without modifying the SE or ASR models, treating them as fixed black boxes. |
W. Wang; S. Zhao; Y. Qian; | icassp | 2025-04-15 |
137 | Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). |
G. CHENG et. al. | icassp | 2025-04-15 |
138 | UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, training large ASR models from scratch remains costly. To address this issue, we introduce UME, a novel method that efficiently Upcycles pretrained dense ASR checkpoints into larger Mixture-of-Eperts (MoE) architectures. |
L. FU et. al. | icassp | 2025-04-15 |
139 | Automatic Speech Recognition and Spoken Language Understanding of Maritime Radio Communications: A Case Study with Singapore Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents several contributions designed to improve the ASR and the SLU systems by releasing a dataset for ASR and SLU task in maritime domain. |
P. Dat; J. M. Madhathil; T. Huy Dat; | icassp | 2025-04-15 |
140 | Enhancing Multilingual ASR for Unseen Languages Via Language Embedding Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, despite its success, Whisper struggles with unseen languages, which are not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose a method that exploits these relationships to improve ASR performance of Whisper in unseen languages. |
S. -S. Huang; K. -P. Huang; A. T. Liu; H. -Y. Lee; | icassp | 2025-04-15 |
141 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Z. Tang; D. Wang; S. Huang; S. Shang; | icassp | 2025-04-15 |
142 | M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. |
Y. Yang; | icassp | 2025-04-15 |
143 | Speech Retrieval-Augmented Generation Without Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. |
D. J. MIN et. al. | icassp | 2025-04-15 |
144 | Injecting Visual Features Into Whisper for Parameter-Efficient Noise-Robust Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This gap highlights the need for more efficient methods to leverage visual and acoustic information in AVSR tasks. To address this challenge, we propose AVWhisper, a parameter-efficient model that integrates visual and acoustic representations by injecting visual features from the AV-HuBERT encoder into the pre-trained Whisper model. |
Z. Yang; | icassp | 2025-04-15 |
145 | Token-Level Contextual Network with Ladder-Shaped Attention for End-to-End ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: More importantly, we propose a creative approach to address the challenge of the increasing size of expanding token list compared to phrase list. |
M. Fang; | icassp | 2025-04-15 |
146 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
E. Sarkar; M. Magimai.-Doss; | icassp | 2025-04-15 |
147 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. |
L. Grigoryan; N. Karpov; E. Albasiri; V. Lavrukhin; B. Ginsburg; | icassp | 2025-04-15 |
148 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
J. Zhao; | icassp | 2025-04-15 |
149 | Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classification alongside ASR output. |
C. Y. KWOK et. al. | icassp | 2025-04-15 |
150 | Speech Recognition Rescoring with Large Speech-Text Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose novel techniques to use multi-modal LLM for ASR rescoring. |
P. G. Shivakumar; | icassp | 2025-04-15 |
151 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
M. BARAŃSKI et. al. | icassp | 2025-04-15 |
152 | Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. |
R. Nasretdinov; R. Korostik; A. Jukić; | icassp | 2025-04-15 |
153 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
A. S. Gyeong Choi; J. Park; M. Oh; | icassp | 2025-04-15 |
154 | Enhancing Low-Resource ASR Through Versatile TTS: Bridging The Data Gap Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. |
G. Yang; | icassp | 2025-04-15 |
155 | Adopting Whisper for Confidence Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. |
V. Aggarwal; S. S. Nair; Y. Verma; Y. Jogi; | icassp | 2025-04-15 |
156 | Uncovering The Visual Contribution in Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper assesses AVSR systems from a different perspective, by considering human speech perception. |
Z. Lin; N. Harte; | icassp | 2025-04-15 |
157 | Speech Slytherin: Examining The Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. |
X. Jiang; Y. A. Li; A. Nicolas Florea; C. Han; N. Mesgarani; | icassp | 2025-04-15 |
158 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose M2R-Whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
J. Zhou; | icassp | 2025-04-15 |
159 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
U. Cappellazzo; | icassp | 2025-04-15 |
160 | Dynamic Language Group-based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
H. Huang; | icassp | 2025-04-15 |
161 | Adaptive Decoding for Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an adaptive decoding method (ADD) to reduce the latency. |
X. Ma; | icassp | 2025-04-15 |
162 | Can Automated Speech Recognition Errors Provide Valuable Clues for Alzheimer’s Disease Detection? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Finally, we conduct an interpretability study, including linguistic and SHapley Additive exPlanations (SHAP) analyses. This study reveals that greater word distribution differences between AD and healthy control (HC) groups in ASR transcripts may be linked to these valuable clues. |
Y. -L. Liu; | icassp | 2025-04-15 |
163 | Visual-Aware Speech Recognition for Noisy Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. |
Lakshmipathi Balaji; Karan Singla; | arxiv-cs.CL | 2025-04-09 |
164 | LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect’s linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets — comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. |
Hedi Naouara; Jean-Pierre Lorré; Jérôme Louradour; | arxiv-cs.CL | 2025-04-03 |
165 | Chain of Correction for Full-text Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. |
ZHIYUAN TANG et. al. | arxiv-cs.CL | 2025-04-02 |
166 | Whispering Under The Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. |
WEIFEI JIN et. al. | arxiv-cs.CR | 2025-04-01 |
167 | The Impact of Code-switched Synthetic Data Quality Is Task Dependent: Insights from MT and ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. |
Injy Hamed; Ngoc Thang Vu; Nizar Habash; | arxiv-cs.CL | 2025-03-30 |
168 | VALLR: Visual ASR Language Model for Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. |
Marshall Thomas; Edward Fish; Richard Bowden; | arxiv-cs.CV | 2025-03-27 |
169 | Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. |
YANGYANG MENG et. al. | arxiv-cs.CL | 2025-03-26 |
170 | Whispering in Amharic: Fine-tuning Whisper for Low-resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research. |
DAWIT KETEMA GETE et. al. | arxiv-cs.CL | 2025-03-24 |
171 | Elevating Robust Multi-Talker ASR By Decoupling Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. |
Yufeng Yang; Hassan Taherian; Vahid Ahmadi Kalkhorani; DeLiang Wang; | arxiv-cs.SD | 2025-03-22 |
172 | Your Voice Is Your Voice: Supporting Self-expression Through Speech Generation and LLMs in Augmented and Alternative Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users’ expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). |
YIWEN XU et. al. | arxiv-cs.HC | 2025-03-21 |
173 | Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
174 | Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We conducted a user study with 75 participants to evaluate the feasibility and efficiency of this workflow. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
175 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. |
AOHAN ZENG et. al. | iclr | 2025-03-17 |
176 | Speech Robust Bench: A Robustness Benchmark For Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. |
Muhammad A Shah; David Solans Noguero; Mikko A. Heikkilä; Bhiksha Raj; Nicolas Kourtellis; | iclr | 2025-03-17 |
177 | CR-CTC: Consistency Regularization on CTC for Improved Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. |
ZENGWEI YAO et. al. | iclr | 2025-03-17 |
178 | T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis Via Multitask Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce T2V2 (**T**ext to **V**oice and **V**oice to **T**ext), a unified non-autoregressive model capable of performing both automatic speech recognition (ASR) and text-to-speech (TTS) synthesis within the same framework. |
Nabarun Goswami; Hanqin Wang; Tatsuya Harada; | iclr | 2025-03-17 |
179 | Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively. |
Soumya Shamarao Jahagirdar; Jayasree Saha; C V Jawahar; | arxiv-cs.CV | 2025-03-11 |
180 | Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases. |
Michael McGuire; | arxiv-cs.CL | 2025-03-10 |
181 | Adaptive Audio-Visual Speech Recognition Via Matryoshka-Based Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, higher compression ratios often lead to performance degradation, necessitating a trade-off between computational efficiency and recognition accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation based on specific computational constraints while preserving high performance. |
Umberto Cappellazzo; Minsu Kim; Stavros Petridis; | arxiv-cs.CV | 2025-03-08 |
182 | Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. |
Lucas Block Medin; Thomas Pellegrini; Lucile Gelin; | arxiv-cs.SD | 2025-03-06 |
183 | Direct Speech to Speech Translation: A Review Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our review examines the evolution of S2ST, comparing traditional cascade models which rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components with newer end to end and direct speech translation (DST) models that bypass intermediate text representations. |
Mohammad Sarim; Saim Shakeel; Laeeba Javed; Mohammad Nadeem; | arxiv-cs.CL | 2025-03-03 |
184 | Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study explores fine-tuning OpenAI’s Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. |
Samuel S. Sohn; Sten Knutsen; Karin Stromswold; | arxiv-cs.SD | 2025-03-03 |
185 | Unveiling Biases While Embracing Sustainability: Assessing The Dual Challenges of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. |
Ajinkya Kulkarni; Atharva Kulkarni; Miguel Couceiro; Isabel Trancoso; | arxiv-cs.CL | 2025-03-02 |
186 | LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. |
Keisuke Kamahori; Jungo Kasai; Noriyuki Kojima; Baris Kasikci; | arxiv-cs.LG | 2025-02-27 |
187 | Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents the development of ASR models fine-tuned specifically for Southeast Asian accents using a newly created dataset. |
MARCUS YU ZHE WEE et. al. | arxiv-cs.LG | 2025-02-27 |
188 | CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2025-02-26 |
189 | Speed Master: Quick or Slow Play to Attack Speaker Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper presents a novel attack methodology named Speed Master, which undermines deep neural networks by manipulating the speed of speech samples. |
ZHE YE et. al. | aaai | 2025-02-25 |
190 | Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. |
YIHAN WU et. al. | aaai | 2025-02-25 |
191 | Exploring Gender Disparities in Automatic Speech Recognition Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study investigates factors influencing Automatic Speech Recognition (ASR) systems’ fairness and performance across genders, beyond the conventional examination of demographics. |
Hend ElGhazaly; Bahman Mirheidari; Nafise Sadat Moosavi; Heidi Christensen; | arxiv-cs.CL | 2025-02-25 |
192 | Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | aaai | 2025-02-25 |
193 | Uncertainty-Aware Self-Training for CTC-Based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the current sequence-level uncertainty estimation method for connectionist temporal classification (CTC) based ASR models drops the output probability information and depends only on the textual distance of decoded predictions. In this study, we argue that this results in limited performance improvement and propose a novel output probability-based sequence-level uncertainty estimation method. |
Eungbeom Kim; Kyogu Lee; | aaai | 2025-02-25 |
194 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). |
KULUHAN BINICI et. al. | aaai | 2025-02-25 |
195 | Silent Speech Sentence Recognition with Six-Axis Accelerometers Using Conformer and CTC Algorithm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A novel silent speech sentence recognition method is proposed to convert the facial motion signals collected by six-axis accelerometers into transcribed words and sentences. |
YUDONG XIE et. al. | arxiv-cs.HC | 2025-02-24 |
196 | Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. |
Haoxuan Wang; | arxiv-cs.CL | 2025-02-22 |
197 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. |
JENALEA RAJAB et. al. | arxiv-cs.CL | 2025-02-21 |
198 | On The Robust Approximation of ASR Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. |
Abdul Waheed; Hanin Atwany; Rita Singh; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
199 | Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. |
Hanin Atwany; Abdul Waheed; Rita Singh; Monojit Choudhury; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
200 | DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. |
XIANGYU LU et. al. | arxiv-cs.CL | 2025-02-16 |
201 | A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We analyze three main architectural approaches: Network Speech Recognition (NSR), Distributed Speech Recognition (DSR), and Embedded Speech Recognition (ESR). |
Inaam F. Qutaiba I. Ali; | arxiv-cs.NI | 2025-02-10 |
202 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | arxiv-cs.CL | 2025-02-06 |
203 | Integrating Automatic Speech Recognition Into Remote Healthcare Interpreting: A Pilot Study of Its Impact on Interpreting Quality Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. |
Shiyi Tan; Constantin Orăsan; Sabine Braun; | arxiv-cs.CL | 2025-02-05 |
204 | CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC’s scaling issues. |
MARTIJN BARTELDS et. al. | arxiv-cs.LG | 2025-02-03 |
205 | When End-to-End Is Overkill: Rethinking Cascaded Speech-to-Text Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. |
Anna Min; Chenxu Hu; Yi Ren; Hang Zhao; | arxiv-cs.CL | 2025-02-01 |
206 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
Turi Abu; Ying Shi; Thomas Fang Zheng; Dong Wang; | arxiv-cs.CL | 2025-02-01 |
207 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
Anna Seo Gyeong Choi; Jonghyeon Park; Myungwoo Oh; | arxiv-cs.CL | 2025-02-01 |
208 | Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. |
Zhengdong Yang; Qianying Liu; Sheng Li; Fei Cheng; Chenhui Chu; | arxiv-cs.CL | 2025-01-29 |
209 | AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we introduce the AVE speech dataset, a comprehensive multi-modal benchmark for speech recognition tasks. |
DONGLIANG ZHOU et. al. | arxiv-cs.SD | 2025-01-28 |
210 | The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study investigates the prevalence and impact of ASR errors in medical transcription in Nigeria, the United Kingdom, and the United States. |
Ayo Adedeji; Mardhiyah Sanni; Emmanuel Ayodele; Sarita Joshi; Tobi Olatunji; | arxiv-cs.CL | 2025-01-25 |
211 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. |
XUELONG GENG et. al. | arxiv-cs.SD | 2025-01-22 |
212 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. |
Moreno La Quatra; Valerio Mario Salerno; Yu Tsao; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-01-22 |
213 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
MATEUSZ BARAŃSKI et. al. | arxiv-cs.SD | 2025-01-20 |
214 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
Takaaki Hori; Martin Kocour; Adnan Haider; Erik McDermott; Xiaodan Zhuang; | arxiv-cs.CL | 2025-01-15 |
215 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
Natarajan Balaji Shankar; Zilai Wang; Eray Eren; Abeer Alwan; | arxiv-cs.CL | 2025-01-14 |
216 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. |
JILIANG HU et. al. | arxiv-cs.SD | 2025-01-13 |
217 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. |
Fabian David Schmidt; Ivan Vulić; Goran Glavaš; David Ifeoluwa Adelani; | arxiv-cs.CL | 2025-01-10 |
218 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
Eklavya Sarkar; Mathew Magimai. -Doss; | arxiv-cs.LG | 2025-01-10 |
219 | Universal-2-TF: Robust All-Neural Text Formatting for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). |
Yash Khare; Taufiquzzaman Peyash; Andrea Vanzo; Takuya Yoshioka; | arxiv-cs.CL | 2025-01-10 |
220 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we conduct a comprehensive evaluation of RoPE across diverse automatic speech recognition (ASR) tasks. |
Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav Bhattacharya; | arxiv-cs.CL | 2025-01-10 |
221 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Samba ASR,the first state of the art Automatic Speech Recognition(ASR)model leveraging the novel Mamba architecture as both encoder and decoder,built on the foundation of state space models(SSMs). |
Syed Abdul Gaffar Shakhadri; Kruthika KR; Kartik Basavaraj Angadi; | arxiv-cs.CL | 2025-01-06 |
222 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
ZHONGJIAN CUI et. al. | arxiv-cs.SD | 2025-01-05 |
223 | Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of “listening and seeing again”. |
Rui Liu; Hongyu Yuan; Haizhou Li; | arxiv-cs.MM | 2025-01-03 |
224 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. |
Or Haim Anidjar; Revital Marbel; Roi Yozevitch; | arxiv-cs.CL | 2024-12-31 |
225 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
KAREL MUNDNICH et. al. | arxiv-cs.CL | 2024-12-24 |
226 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
JIAHUI ZHAO et. al. | arxiv-cs.CL | 2024-12-21 |
227 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
HE WANG et. al. | arxiv-cs.SD | 2024-12-17 |
228 | Style-agnostic Evaluation of ASR Using Multiple Reference Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. |
QUINTEN MCNAMARA et. al. | arxiv-cs.CL | 2024-12-10 |
229 | Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. |
Yerin Choi; Jeehyun Lee; Myoung-Wan Koo; | arxiv-cs.SD | 2024-12-04 |
230 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We apply LLMs to ASR error correction in three paradigms. |
Victor Junqiu Wei; Weicheng Wang; Di Jiang; Yuanfeng Song; Lu Wang; | arxiv-cs.CL | 2024-12-04 |
231 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. |
AOHAN ZENG et. al. | arxiv-cs.CL | 2024-12-03 |
232 | Advancing CTC Models for Better Speech Alignment: A Topological Approach Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems often face challenges in alignment quality, particularly with the Connectionist Temporal Classification (CTC) approach, which frequently … |
Zeyu Zhao; Peter Bell; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
233 | A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. |
Zheshu Song; Ziyang Ma; Yifan Yang; Jianheng Zhuo; Xie Chen; | arxiv-cs.AI | 2024-12-01 |
234 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). |
GEOFFREY TYNDALL et. al. | arxiv-cs.CL | 2024-11-27 |
235 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.CL | 2024-11-27 |
236 | How to Learn A New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. |
SHIH-HENG WANG et. al. | arxiv-cs.SD | 2024-11-27 |
237 | Aligning Pre-trained Models for Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). |
Šimon Sedláček; Santosh Kesiraju; Alexander Polok; Jan Černocký; | arxiv-cs.CL | 2024-11-27 |
238 | Comparative Analysis of ASR Methods for Speech Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. |
DAVIDE SALVI et. al. | arxiv-cs.SD | 2024-11-26 |
239 | High-precision Medical Speech Recognition Through Synthetic Data and Semantic Correction: UNITED-MEDASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet … |
Sourav Banerjee; Ayushi Agarwal; Promila Ghosh; | ArXiv | 2024-11-24 |
240 | CAFE A Novel Code Switching Dataset for Algerian Dialect French and English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper introduces and publicly releases (Data download link available after acceptance) CAFE — the first Code-switching dataset between Algerian dialect, French, and english languages. |
HOUSSAM EDDINE-OTHMAN LACHEMAT et. al. | arxiv-cs.SD | 2024-11-20 |
241 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR Using Zero-Shot TTS and LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. |
JIAWEI YU et. al. | arxiv-cs.CL | 2024-11-20 |
242 | Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. |
RUIYANG QIN et. al. | arxiv-cs.SD | 2024-11-20 |
243 | BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. |
MD. NAZMUS SADAT SAMIN et. al. | arxiv-cs.CL | 2024-11-16 |
244 | Interactive Cycle Model: The Linkage Combination Among Automatic Speech Recognition, Large Language Models and Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This research proposes the interaction loop model ASR-LLMs-Smart Glasses, which model combines automatic speech recognition, large language model and smart glasses to facilitate seamless human-computer interaction. |
Libo Wang; | arxiv-cs.HC | 2024-11-15 |
245 | Everyone Deserves Their Voice to Be Heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. |
Rik Raes; Saskia Lensink; Mykola Pechenizkiy; | arxiv-cs.CL | 2024-11-14 |
246 | Optimizing Entity Resolution in Voice Interfaces: An ASR-Aware Entity Reference Expansion Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Navigating the equilibrium between accuracy and online retrieval’s speed requirement proves challenging, particularly when limited data links the failed mentions to resolved entities. In this paper, we propose a entity reference expansion system, injecting pairs of failed mentions and resolved entity names into the knowledge graph, enhancing its awareness of unresolved mentions. |
Jiangning Chen; Ziyun Zhang; Qianli Hu; | emnlp | 2024-11-11 |
247 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | emnlp | 2024-11-11 |
248 | Task Arithmetic Can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task arithmetic is effective at mitigating this gap. |
Hsuan Su; Hua Farn; Fan-Yun Sun; Shang-Tse Chen; Hung-yi Lee; | emnlp | 2024-11-11 |
249 | BLSP-Emo: Towards Empathetic Large Speech-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. |
CHEN WANG et. al. | emnlp | 2024-11-11 |
250 | Advancing Test-Time Adaptation in Wild Acoustic Test Settings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. |
Hongfu Liu; Hengguan Huang; Ye Wang; | emnlp | 2024-11-11 |
251 | VHASR: A Multimodal Speech Recognition System With Vision Hotwords Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. |
JILIANG HU et. al. | emnlp | 2024-11-11 |
252 | Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. |
YEONJOON JUNG et. al. | emnlp | 2024-11-11 |
253 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; | emnlp | 2024-11-11 |
254 | Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. |
Giuseppe Attanasio; Beatrice Savoldi; Dennis Fucci; Dirk Hovy; | emnlp | 2024-11-11 |
255 | Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). |
Yoshiki Masuyama; Koichi Miyazaki; Masato Murata; | arxiv-cs.SD | 2024-11-11 |
256 | Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Following this framework, we introduce Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR. |
Guan-Ting Lin; Wei Ping Huang; Hung-yi Lee; | emnlp | 2024-11-11 |
257 | Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. |
Leena G Pillai; Kavya Manohar; Basil K Raju; Elizabeth Sherly; | arxiv-cs.CL | 2024-11-07 |
258 | Dialectal Coverage And Generalization in Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. |
Amirbek Djanibekov; Hawau Olamide Toyin; Raghad Alshalan; Abdullah Alitr; Hanan Aldarmaki; | arxiv-cs.CL | 2024-11-07 |
259 | Enhancing AAC Software for Dysarthric Speakers in E-Health Settings: An Evaluation Using TORGO Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. |
Macarious Hui; Jinda Zhang; Aanchan Mohan; | arxiv-cs.CL | 2024-11-01 |
260 | Evaluation of Speech Translation Subtitles Generated By ASR with Unnecessary Word Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study addresses the problem of generating understandable speech translation subtitles for spontaneous speech, such as lectures and talks, which often contain disfluencies … |
Makoto Hotta; Chee Siang Leow; N. Kitaoka; Hiromitsu Nishizaki; | 2024 IEEE 13th Global Conference on Consumer Electronics … | 2024-10-29 |
261 | Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach that first refines all available transcriptions to ensure data reliability. |
Enshi Zhang; Christian Poellabauer; | arxiv-cs.CL | 2024-10-27 |
262 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. |
Hawau Olamide Toyin; Hao Li; Hanan Aldarmaki; | arxiv-cs.CL | 2024-10-24 |
263 | Evaluating Automatic Speech Recognition Systems for Korean Meteorological Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. |
ChaeHun Park; Hojun Cho; Jaegul Choo; | arxiv-cs.CL | 2024-10-24 |
264 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
Suryoday Basak; Abhijeeth Padarthi; Mahanth Gowda; | arxiv-cs.SD | 2024-10-22 |
265 | DENOASR: Debiasing ASRs Through Selective Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. |
Anand Kumar Rai; Siddharth D Jaiswal; Shubham Prakash; Bendi Pragnya Sree; Animesh Mukherjee; | arxiv-cs.SD | 2024-10-22 |
266 | VoiceBench: Benchmarking LLM-Based Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field. |
YIMING CHEN et. al. | arxiv-cs.CL | 2024-10-22 |
267 | Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | arxiv-cs.CL | 2024-10-17 |
268 | Investigation of Speaker Representation for Target-Speaker Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? |
TAKANORI ASHIHARA et. al. | arxiv-cs.SD | 2024-10-14 |
269 | Automatic Speech Recognition with BERT and CTC Transformers: A Review IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: All in all, this review provides valuable insights for researchers and practitioners who are interested in ASR with BERT and CTC transformers. |
Noussaiba Djeffal; Hamza Kheddar; Djamel Addou; Ahmed Cherif Mazari; Yassine Himeur; | arxiv-cs.CL | 2024-10-12 |
270 | Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. |
AULIA ADILA et. al. | arxiv-cs.CL | 2024-10-11 |
271 | Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we hypothesize that incorporating speaker representations during speech recognition can enhance model robustness to noise. |
Sagarika Alavilli; Annesya Banerjee; Gasser Elbanna; Annika Magaro; | arxiv-cs.SD | 2024-10-07 |
272 | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | nips | 2024-10-07 |
273 | REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. |
LIANG-HSUAN TSENG et. al. | nips | 2024-10-07 |
274 | Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for The Polish Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. |
Michał Junczyk; | nips | 2024-10-07 |
275 | Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Large language models (LLMs) have started to play a vital role in modelling speech and text. |
Pavel Stepachev; Pinzhen Chen; Barry Haddow; | arxiv-cs.CL | 2024-10-04 |
276 | Reverb: Open-Source ASR and Diarization from Rev Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as … |
NISHCHAL BHANDARI et. al. | arxiv-cs.CL | 2024-10-04 |
277 | Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). |
Olga Iakovenko; Ivan Bondarenko; | arxiv-cs.SD | 2024-10-03 |
278 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. |
Olga Iakovenko; Ivan Bondarenko; Mariya Borovikova; Daniil Vodolazsky; | arxiv-cs.CL | 2024-10-03 |
279 | Automatic Speech Recognition for The Ika Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a cost-effective approach for developing Automatic Speech Recognition (ASR) models for low-resource languages like Ika. |
Uchenna Nzenwata; Daniel Ogbuigwe; | arxiv-cs.CL | 2024-10-01 |
280 | Recent Advances in Speech Language Models: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. |
WENQIAN CUI et. al. | arxiv-cs.CL | 2024-10-01 |
281 | AfriHuBERT: A Self-supervised Speech Representation Model for African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. |
Jesujoba O. Alabi; Xuechen Liu; Dietrich Klakow; Junichi Yamagishi; | arxiv-cs.CL | 2024-09-30 |
282 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2024-09-27 |
283 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
Brian Yan; Vineel Pratap; Shinji Watanabe; Michael Auli; | arxiv-cs.CL | 2024-09-26 |
284 | Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. |
Robin Shing-Hei Yuen; Timothy Tin-Long Tse; Jian Zhu; | arxiv-cs.CL | 2024-09-25 |
285 | Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. |
Andrés Piñeiro-Martín; Carmen García-Mateo; Laura Docío-Fernández; María del Carmen López-Pérez; Georg Rehm; | arxiv-cs.CL | 2024-09-25 |
286 | Spelling Correction Through Rewriting of Non-Autoregressive ASR Lattices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. |
LEONID VELIKOVICH et. al. | arxiv-cs.CL | 2024-09-24 |
287 | Evaluating The Robustness of ASR Systems in Adverse Acoustic Conditions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The effectiveness of automatic speech recognition (ASR) systems in environments with acoustic challenges directly influences their utility in a range of voice-activated … |
Sergei Katkov; Antonio Liotta; A. Vietti; | 2024 Fifth International Conference on Intelligent Data … | 2024-09-24 |
288 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). |
FENGRUN ZHANG et. al. | arxiv-cs.SD | 2024-09-24 |
289 | Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel training approach to enhance LLM performance in ASR tasks. |
Yang Yuhang; Peng Yizhou; Eng Siong Chng; Xionghu Zhong; | arxiv-cs.CL | 2024-09-24 |
290 | MultiMed: Multilingual Medical Speech Recognition Via Attention Encoder Decoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. |
KHAI LE-DUC et. al. | arxiv-cs.CL | 2024-09-21 |
291 | Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation with Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). |
IULIIA THORBECKE et. al. | arxiv-cs.CL | 2024-09-20 |
292 | A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. |
Georgios Sidiropoulos; Evangelos Kanoulas; | arxiv-cs.CL | 2024-09-20 |
293 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. |
Sebastião Quintas; Isabelle Ferrané; Thomas Pellegrini; | arxiv-cs.SD | 2024-09-19 |
294 | Personalized Speech Recognition for Children with Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We devised a novel ASR pipeline to apply unsupervised test-time adaptation (TTA) methods for child speech recognition, so that ASR models pre-trained on adult speech can be continuously adapted to each child speaker at test time without further human annotations. |
Zhonghao Shi; Harshvardhan Srivastava; Xuan Shi; Shrikanth Narayanan; Maja J. Matarić; | arxiv-cs.LG | 2024-09-19 |
295 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. |
Gaurav Maheshwari; Dmitry Ivanov; Théo Johannet; Kevin El Haddad; | arxiv-cs.CL | 2024-09-18 |
296 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
UMBERTO CAPPELLAZZO et. al. | arxiv-cs.CV | 2024-09-18 |
297 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2024-09-18 |
298 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
CHIEN-CHUN WANG et. al. | arxiv-cs.SD | 2024-09-18 |
299 | Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a speech generation system that simulates the L1 shadowing process using voice conversion (VC) techniques and latent speech representations. |
Haopeng Geng; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2024-09-18 |
300 | WER We Stand: Benchmarking Urdu ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. |
SAMEE ARIF et. al. | arxiv-cs.CL | 2024-09-17 |
301 | Chain-of-Thought Prompting for Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
KE HU et. al. | arxiv-cs.CL | 2024-09-17 |
302 | Speech Recognition for Analysis of Police Radio Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. |
Tejes Srivastava; Ju-Chieh Chou; Priyank Shroff; Karen Livescu; Christopher Graziul; | arxiv-cs.SD | 2024-09-16 |
303 | Augmenting Automatic Speech Recognition Models with Disfluency Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. |
Robin Amann; Zhaolin Li; Barbara Bruno; Jan Niehues; | arxiv-cs.CL | 2024-09-16 |
304 | Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. |
CHAO-HAN HUCK YANG et. al. | arxiv-cs.CL | 2024-09-15 |
305 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. |
SHAOJUN LI et. al. | arxiv-cs.SD | 2024-09-13 |
306 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
LINGWEI MENG et. al. | arxiv-cs.CL | 2024-09-13 |
307 | Exploring SSL Discrete Tokens for Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. |
MINGYU CUI et. al. | arxiv-cs.CL | 2024-09-13 |
308 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. |
Ahmed Adel Attia; Dorottya Demszky; Tolulope Ogunremi; Jing Liu; Carol Espy-Wilson; | arxiv-cs.CL | 2024-09-13 |
309 | M$^{3}$V: A Multi-modal Multi-view Approach for Device-Directed Speech Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR). To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. |
ANNA WANG et. al. | arxiv-cs.SD | 2024-09-13 |
310 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-09-12 |
311 | The Faetar Benchmark: Speech Recognition in A Very Under-Resourced Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. |
MICHAEL ONG et. al. | arxiv-cs.CL | 2024-09-12 |
312 | WhisperNER: Unified Open Named Entity and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. |
GIL AYACHE et. al. | arxiv-cs.CL | 2024-09-12 |
313 | Enhancing CTC-Based Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). |
Hendrik Laux; Anke Schmeink; | arxiv-cs.CV | 2024-09-11 |
314 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Batthacharya; | arxiv-cs.SD | 2024-09-11 |
315 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. |
Jihyun Lee; Solee Im; Wonjun Lee; Gary Geunbae Lee; | arxiv-cs.CL | 2024-09-10 |
316 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; Elizabeth Sherly; | arxiv-cs.CL | 2024-09-04 |
317 | Quantification of Stylistic Differences in Human- and ASR-produced Transcripts of African American English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). |
ANNIKA HEUSER et. al. | arxiv-cs.CL | 2024-09-04 |
318 | Comparing Discrete and Continuous Space LLMs for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. |
Yaoxun Xu; Shi-Xiong Zhang; Jianwei Yu; Zhiyong Wu; Dong Yu; | arxiv-cs.CL | 2024-09-01 |
319 | Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. |
Hao Shi; Yuan Gao; Zhaoheng Ni; Tatsuya Kawahara; | arxiv-cs.SD | 2024-09-01 |
320 | LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. |
ZENGRUI JIN et. al. | arxiv-cs.SD | 2024-09-01 |
321 | ProGRes: Prompted Generative Rescoring on ASR N-Best Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. |
Ada Defne Tur; Adel Moumen; Mirco Ravanelli; | arxiv-cs.CL | 2024-08-30 |
322 | Measuring The Accuracy of Automatic Speech Recognition Solutions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. |
Korbinian Kuhn; Verena Kersken; Benedikt Reuter; Niklas Egger; Gottfried Zimmermann; | arxiv-cs.CL | 2024-08-29 |
323 | Speech Recognition Transformers: Topological-lingualism Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper presents a comprehensive survey of transformer techniques oriented in speech modality. |
Shruti Singh; Muskaan Singh; Virender Kadyan; | arxiv-cs.CL | 2024-08-27 |
324 | Self-supervised Speech Representations Still Struggle with African American Vernacular English Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. |
KALVIN CHANG et. al. | arxiv-cs.CL | 2024-08-26 |
325 | Developing Vocal System Impaired Patient-aimed Voice Quality Assessment Approach Using ASR Representation-included Multiple Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. |
SHAOXIANG DANG et. al. | arxiv-cs.SD | 2024-08-22 |
326 | Towards Measuring Fairness in Speech Recognition: Fair-Speech Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. |
IRINA-ELENA VELICHE et. al. | arxiv-cs.AI | 2024-08-22 |
327 | The State of Commercial Automatic French Legal Speech Recognition Systems and Their Impact on Court Reporters Et Al Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We benchmark three ASR models, including commercial and open-source options, on their ability to recognize French legal speech using a curated dataset. Our study evaluates the performance of these systems using the Word Error Rate (WER) metric and introduces the Sonnex Distance to account for phonetic accuracy. |
Nicolad Garneau; Olivier Bolduc; | arxiv-cs.CL | 2024-08-21 |
328 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. |
Prashant Serai; Peidong Wang; Eric Fosler-Lussier; | arxiv-cs.AI | 2024-08-20 |
329 | Error-preserving Automatic Speech Recognition of Young English Learners� Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the mistakes made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their mistakes. |
JANICK MICHOT et. al. | acl | 2024-08-20 |
330 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; J�r�me Goulian; Benjamin Lecouteux; | acl | 2024-08-20 |
331 | CopyNE: Better Contextual ASR By Copying Named Entities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. |
SHILIN ZHOU et. al. | acl | 2024-08-20 |
332 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. |
SHAOLEI ZHANG et. al. | acl | 2024-08-20 |
333 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn�t Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | acl | 2024-08-20 |
334 | XCB: An Effective Contextual Biasing Approach to Bias Cross-lingual Phrases in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. |
Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; | arxiv-cs.CL | 2024-08-20 |
335 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. |
YANGZE LI et. al. | arxiv-cs.SD | 2024-08-18 |
336 | Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and … |
Yinghao Aaron Li; Xilin Jiang; Jordan Darefsky; Ge Zhu; N. Mesgarani; | ArXiv | 2024-08-13 |
337 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness Via Noise Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. |
Wonjun Lee; San Kim; Gary Geunbae Lee; | arxiv-cs.CL | 2024-08-12 |
338 | Audio Enhancement for Computer Audition — An Iterative Training Paradigm Using Sample Importance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. |
Manuel Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Björn W. Schuller; | arxiv-cs.SD | 2024-08-12 |
339 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. |
Eunseop Yoon; Hee Suk Yoon; John Harvill; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2024-08-11 |
340 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. |
JUNHAO XU et. al. | arxiv-cs.CL | 2024-08-09 |
341 | Robust Speech Enhancement Using Dabauchies Wavelet Based Adaptive Wavelet Thresholding for The Development of Robust Automatic Speech Recognition: A Comprehensive Review Related Papers Related Patents Related Grants Related Venues Related Experts View |
Mahadevaswamy Shanthamallappa; | Wirel. Pers. Commun. | 2024-08-06 |
342 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. |
JAEYOUNG KIM et. al. | arxiv-cs.SD | 2024-08-05 |
343 | Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce a novel framework that diverges from typical second-pass rescoring methods. |
Yixuan Tang; Anthony K. H. Tung; | ijcai | 2024-08-03 |
344 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms Using Linguistic Features Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. |
PENG CHENG et. al. | arxiv-cs.CR | 2024-08-03 |
345 | MECOS: A Bilingual Manipuri-English Spontaneous Code-switching Speech Corpus for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
Naorem Karline Singh; Y. J. Chanu; Hoomexsun Pangsatabam; | Comput. Speech Lang. | 2024-08-01 |
346 | On The Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. |
Nick Rossenbach; Ralf Schlüter; Sakriani Sakti; | arxiv-cs.CL | 2024-07-31 |
347 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. |
KOHEI MATSUURA et. al. | arxiv-cs.CL | 2024-07-31 |
348 | Improving Noisy Student Training for Low-resource Languages in End-to-End ASR Using CycleGAN and Inter-domain Losses Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial … |
C. Li; Ngoc Thang Vu; | ArXiv | 2024-07-26 |
349 | Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
HUKAI HUANG et. al. | arxiv-cs.CL | 2024-07-26 |
350 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. |
Jiwon Suh; Injae Na; Woohwan Jung; | arxiv-cs.CL | 2024-07-25 |
351 | On The Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). |
Benedikt Hilmes; Nick Rossenbach; and Ralf Schlüter; | arxiv-cs.CL | 2024-07-25 |
352 | Coupling Speech Encoders with Downstream Text Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. |
Ciprian Chelba; Johan Schalkwyk; | arxiv-cs.CL | 2024-07-24 |
353 | Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern … |
Rithik Sachdev; Zhong-Qiu Wang; Chao-Han Huck Yang; | arxiv-cs.CL | 2024-07-23 |
354 | Quantifying The Role of Textual Predictability in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic–phonetic modelling. |
Sean Robertson; Gerald Penn; Ewan Dunbar; | arxiv-cs.CL | 2024-07-23 |
355 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. |
Hazim Bukhari; Soham Deshmukh; Hira Dhamyal; Bhiksha Raj; Rita Singh; | arxiv-cs.SD | 2024-07-21 |
356 | Low-Resourced Speech Recognition for Iu Mien Language Via Weakly-Supervised Phoneme-based Multilingual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. |
LUKUAN DONG et. al. | arxiv-cs.SD | 2024-07-18 |
357 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding By Provenance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic speech recognition (ASR) models trained on large amounts of audio data are now widely used to convert speech to written text in a variety of applications from video captioning to automated assistants used in healthcare and other domains. |
Changye Li; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-07-18 |
358 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present novel approaches that use a generative pretrained transformer (GPT) to identify paraphasias from transcripts as well as two end-to-end approaches that focus on modeling both automatic speech recognition (ASR) and paraphasia classification as multiple sequences vs. a single sequence. |
Matthew Perez; Aneesha Sampath; Minxue Niu; Emily Mower Provost; | arxiv-cs.CL | 2024-07-15 |
359 | Textless Dependency Parsing By Labeled Sequence Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper proposes a textless method for dependency parsing, examining its effectiveness and limitations. |
Shunsuke Kando; Yusuke Miyao; Jason Naradowsky; Shinnosuke Takamichi; | arxiv-cs.CL | 2024-07-14 |
360 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer Based Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. |
Wenbo Zhao; Ziwei Li; Chuan Yu; Zhijian Ou; | arxiv-cs.SD | 2024-07-14 |
361 | Empowering Whisper As A Joint Multi-Talker and Target-Talker Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2024-07-13 |
362 | HebDB: A Weakly Supervised Dataset for Hebrew Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. |
ARNON TURETZKY et. al. | arxiv-cs.CL | 2024-07-10 |
363 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study yields numerous significant findings that we are discussing in this paper. |
Salima Mdhaffar; Haroun Elleuch; Fethi Bougares; Yannick Estève; | arxiv-cs.CL | 2024-07-05 |
364 | LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. |
HAECHAN KIM et. al. | arxiv-cs.CL | 2024-07-05 |
365 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. |
Vyas Raina; Mark Gales; | arxiv-cs.SD | 2024-07-05 |
366 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | arxiv-cs.CL | 2024-07-05 |
367 | Romanization Encoding For Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. |
WEN DING et. al. | arxiv-cs.CL | 2024-07-05 |
368 | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). |
KEYU AN et. al. | arxiv-cs.SD | 2024-07-04 |
369 | Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. |
Cong-Thanh Do; Shuhei Imai; Rama Doddipatla; Thomas Hain; | arxiv-cs.CL | 2024-07-04 |
370 | Improving Self-supervised Pre-training Using Accent-Specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. |
Darshan Prabhu; Abhishek Gupta; Omkar Nitsure; Preethi Jyothi; Sriram Ganapathy; | arxiv-cs.CL | 2024-07-04 |
371 | Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. |
Tiia Sildam; Andra Velve; Tanel Alumäe; | arxiv-cs.CL | 2024-07-04 |
372 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. |
Jinming Chen; Jingyi Fang; Yuanzhong Zheng; Yaoxuan Wang; Haojun Fei; | arxiv-cs.SD | 2024-07-03 |
373 | Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via … |
SHUJIE HU et. al. | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-07-03 |
374 | Natural Language Processing for Recognizing Bangla Speech with Regular and Regional Dialects: A Survey of Algorithms and Approaches Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Natural Language Processing (NLP) is one of the fundamental domains of Artificial Intelligence (AI). In this paper, we present a systematic review of NLP based research for … |
PARAMITA BASAK UPAMA et. al. | 2024 IEEE 48th Annual Computers, Software, and Applications … | 2024-07-02 |
375 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-07-01 |
376 | PaSCoNT – Parallel Speech Corpus of Northern-central Thai for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
SUPAWAT TAERUNGRUANG et. al. | Comput. Speech Lang. | 2024-07-01 |
377 | Less Is More: Accurate Speech Recognition & Translation Without Web-Scale Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that state-of-the art accuracy can be reached without relying on web-scale data. |
KRISHNA C. PUVVADA et. al. | arxiv-cs.CL | 2024-06-28 |
378 | Enhanced ASR Robustness to Packet Loss with A Front-End Adaptation Network Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose using a front-end adaptation network connected to a frozen ASR model. |
Yehoshua Dissen; Shiry Yonash; Israel Cohen; Joseph Keshet; | arxiv-cs.SD | 2024-06-27 |
379 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. |
ZHENG FANG et. al. | arxiv-cs.CR | 2024-06-27 |
380 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. |
Ahmed Heakl; Youssef Zaghloul; Mennatullah Ali; Rania Hossam; Walid Gomaa; | arxiv-cs.CL | 2024-06-26 |
381 | Automatic Speech Recognition for Hindi Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations. |
Anish Saha; A. G. Ramakrishnan; | arxiv-cs.CL | 2024-06-26 |
382 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. |
Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie; | arxiv-cs.SD | 2024-06-26 |
383 | Dynamic Data Pruning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. |
QIAO XIAO et. al. | arxiv-cs.CL | 2024-06-26 |
384 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. |
ADRIANA FERNANDEZ-LOPEZ et. al. | arxiv-cs.CV | 2024-06-25 |
385 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. |
VAN TUNG PHAM et. al. | arxiv-cs.LG | 2024-06-25 |
386 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. |
Shuaishuai Ye; Shunfei Chen; Xinhui Hu; Xinkang Xu; | arxiv-cs.SD | 2024-06-25 |
387 | FASA: A Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children’s speech data from many of the existing noisy children’s speech data. |
Dancheng Liu; Jinjun Xiong; | arxiv-cs.CL | 2024-06-25 |
388 | Sequential Editing for Lifelong Training of Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. |
Devang Kulshreshtha; Saket Dingliwal; Brady Houston; Nikolaos Pappas; Srikanth Ronanki; | arxiv-cs.CL | 2024-06-25 |
389 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech … |
SAI KONERU et. al. | ArXiv | 2024-06-24 |
390 | Exploring The Capability of Mamba in Speech Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. |
Koichi Miyazaki; Yoshiki Masuyama; Masato Murata; | arxiv-cs.SD | 2024-06-24 |
391 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present KIT’s offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. |
SAI KONERU et. al. | arxiv-cs.CL | 2024-06-24 |
392 | PI-Whisper: Designing An Adaptive and Incremental Automatic Speech Recognition System for Edge Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. |
AMIR NASSERELDINE et. al. | arxiv-cs.CL | 2024-06-21 |
393 | Perception of Phonological Assimilation By Neural Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). |
Charlotte Pouw; Marianne de Heer Kloots; Afra Alishahi; Willem Zuidema; | arxiv-cs.CL | 2024-06-21 |
394 | PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: As edge-based automatic speech recognition (ASR) technologies become increasingly prevalent for the development of intelligent and personalized assistants, three important … |
Amir Nassereldine; Dancheng Liu; Chenhui Xu; Jinjun Xiong; | ArXiv | 2024-06-21 |
395 | Massive End-to-end Speech Recognition Models with Time Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. |
WEIRAN WANG et. al. | naacl | 2024-06-20 |
396 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | naacl | 2024-06-20 |
397 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. |
Suyoung Kim; Jiyeon Hwang; Ho-Young Jung; | naacl | 2024-06-20 |
398 | Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. |
Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng; | arxiv-cs.AI | 2024-06-20 |
399 | Joint Vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. |
Alexander Blatt; Aravind Krishnan; Dietrich Klakow; | arxiv-cs.CL | 2024-06-19 |
400 | Children’s Speech Recognition Through Discrete Token Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate the integration of discrete speech tokens into children’s speech recognition systems as input without significantly degrading the ASR performance. |
Vrunda N. Sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2024-06-19 |
401 | ManWav: The First Manchu ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. |
Jean Seo; Minha Kang; Sungjoo Byun; Sangah Lee; | arxiv-cs.CL | 2024-06-19 |
402 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; Jérôme Goulian; Benjamin Lecouteux; | arxiv-cs.CL | 2024-06-18 |
403 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. |
Hayato Futami; Siddhant Arora; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe; | arxiv-cs.CL | 2024-06-18 |
404 | Bridging The Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. |
KUAN-CHEN WANG et. al. | arxiv-cs.SD | 2024-06-18 |
405 | Automatic Speech Recognition for Biomedical Data in Bengali Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR … |
Shariar Kabir; Nazmun Nahar; Shyamasree Saha; Mamunur Rashid; | ArXiv | 2024-06-16 |
406 | CoSTA: Code-Switched Speech Translation Using Aligned Speech-Text Interleaving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. |
Bhavani Shankar; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2024-06-16 |
407 | Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. |
Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen; | arxiv-cs.SD | 2024-06-16 |
408 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. |
Haoyu Wang; Guoqiang Hu; Guodong Lin; Wei-Qiang Zhang; Jian Li; | arxiv-cs.SD | 2024-06-14 |
409 | An Efficient Text Augmentation Approach for Contextualized Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. |
Naijun Zheng; Xucheng Wan; Kai Liu; Ziqing Du; Zhou Huan; | arxiv-cs.SD | 2024-06-14 |
410 | Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We introduce Speech ReaLLM, a new ASR architecture that marriesdecoder-onlyASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the … |
FRANK SEIDE et. al. | ArXiv | 2024-06-13 |
411 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a transcription-free method for joint training using only audio signals. |
WILLIAM RAVENSCROFT et. al. | arxiv-cs.SD | 2024-06-13 |
412 | LASER: Learning By Aligning Self-supervised Representations of Speech for Improving Content-related Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named LASER: Learning by Aligning Self-supervised Representations is presented. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-06-13 |
413 | Speech ReaLLM — Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Speech ReaLLM, a new ASR architecture that marries decoder-only ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. |
FRANK SEIDE et. al. | arxiv-cs.CL | 2024-06-13 |
414 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
ZIYANG ZHUANG et. al. | arxiv-cs.SD | 2024-06-13 |
415 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | arxiv-cs.CL | 2024-06-13 |
416 | The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of … |
SHAREEF BABU KALLURI et. al. | ArXiv | 2024-06-13 |
417 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. |
JIATONG SHI et. al. | arxiv-cs.SD | 2024-06-12 |
418 | Training Data Augmentation for Dysarthric Automatic Speech Recognition By Text-to-Dysarthric-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. |
Wing-Zin Leung; Mattias Cross; Anton Ragni; Stefan Goetze; | arxiv-cs.SD | 2024-06-12 |
419 | Improving Child Speech Recognition with Augmented Child-like Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied … |
Yuanyuan Zhang; Zhengjun Yue; T. Patel; O. Scharenborg; | ArXiv | 2024-06-12 |
420 | Towards Unsupervised Speech Recognition Without Pronunciation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. |
JUNRUI NI et. al. | arxiv-cs.CL | 2024-06-12 |
421 | PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. |
TRANG LE et. al. | arxiv-cs.CL | 2024-06-11 |
422 | AS-70: A Mandarin Stuttered Speech Dataset for Automatic Speech Recognition and Stuttering Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. |
RONG GONG et. al. | arxiv-cs.SD | 2024-06-11 |
423 | Reading Miscue Detection in Primary School Through Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1\%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8\%). |
Lingyun Gao; Cristian Tejedor-Garcia; Helmer Strik; Catia Cucchiarini; | arxiv-cs.CL | 2024-06-11 |
424 | MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling Methods for Learning Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. |
Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah; | arxiv-cs.CL | 2024-06-09 |
425 | LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of … |
ZHESHU SONG et. al. | ArXiv | 2024-06-07 |
426 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2024-06-06 |
427 | Hypernetworks for Personalizing ASR to Atypical Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. |
Max Müller-Eberstein; Dianna Yee; Karren Yang; Gautam Varma Mantena; Colin Lea; | arxiv-cs.LG | 2024-06-06 |
428 | Error-preserving Automatic Speech Recognition of Young English Learners’ Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. |
JANICK MICHOT et. al. | arxiv-cs.CL | 2024-06-05 |
429 | Text Injection for Neural Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes contextual text injection (CTI) to enhance contextual ASR. |
ZHONG MENG et. al. | arxiv-cs.CL | 2024-06-05 |
430 | Discrete Multimodal Transformers with A Pretrained Large Language Model for Mixed-Supervision Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). |
VIET ANH TRINH et. al. | arxiv-cs.CL | 2024-06-04 |
431 | Efficiently Train ASR Models That Memorize Less and Perform Better with Per-core Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. |
LUN WANG et. al. | arxiv-cs.CR | 2024-06-04 |
432 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition Via Weakly Phonetic Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. |
Saierdaer Yusuyin; Te Ma; Hao Huang; Wenbo Zhao; Zhijian Ou; | arxiv-cs.SD | 2024-06-04 |
433 | Speaking of Accent: A Content Analysis of Accent Misconceptions in ASR Research Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) researchers are working to address the differing transcription performance of ASR by accent or dialect. However, research often has a limited … |
Kerri Prinos; Neal Patwari; Cathleen A. Power; | Proceedings of the 2024 ACM Conference on Fairness, … | 2024-06-03 |
434 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. |
Ara Yeroyan; Nikolay Karpov; | arxiv-cs.CL | 2024-06-03 |
435 | Pass The Butter: A Study on Desktop-classic Multitasking Robotic Arm Based on Advanced YOLOv7 and BERT Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. |
HAOHUA QUE et. al. | arxiv-cs.RO | 2024-05-27 |
436 | Denoising LM: Pushing The Limits of Error Correction Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. |
ZIJIN GU et. al. | arxiv-cs.LG | 2024-05-24 |
437 | Let’s Fuse Step By Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Generative Fusion Decoding (GFD), a novel shallow fusion framework designed to integrate large language models (LLMs) into cross-modal text recognition systems for automatic speech recognition (ASR) and optical character recognition (OCR). |
CHAN-JAN HSU et. al. | arxiv-cs.CL | 2024-05-23 |
438 | You Don’t Understand Me!: Comparing ASR Results for L1 and L2 Speakers of Swedish IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. |
Ronald Cumbal; Birger Moell; Jose Lopes; Olof Engwall; | arxiv-cs.CL | 2024-05-22 |
439 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot’s ego speech using only a single-channel microphone. |
Yue Li; Florian A. Kunneman; Koen V. Hindriks; | arxiv-cs.HC | 2024-05-22 |
440 | Linguistic Analysis of Human-computer Interaction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This article reviews recent literature investigating speech variation in production and comprehension during spoken language communication between humans and devices. Human speech … |
Georgia Zellou; Nicole Holliday; | Frontiers Comput. Sci. | 2024-05-21 |
441 | Non-autoregressive Real-time Accent Conversion Model with Voice Cloning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We have developed the non-autoregressive model for real-time accent conversion with voice cloning. |
Vladimir Nechaev; Sergey Kosyakov; | arxiv-cs.SD | 2024-05-21 |
442 | A Study on Speech Recognition By A Neural Network Based on English Speech Feature Parameters Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this study, from the perspective of English speech feature parameters, two feature parameters, the mel-frequency cepstral coefficient (MFCC) and filter bank (Fbank), were … |
Congmin Mao; Sujing Liu; | J. Adv. Comput. Intell. Intell. Informatics | 2024-05-20 |
443 | Listen Again and Choose The Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. |
YUCHEN HU et. al. | arxiv-cs.CL | 2024-05-16 |
444 | Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. … |
Ahmed Adel Attia; Dorottya Demszky; Tolúlopé Ògúnrèmí; Jing Liu; Carol Y. Espy-Wilson; | ArXiv | 2024-05-15 |
445 | Towards Evaluating The Robustness of Automatic Speech Recognition Systems Via Audio Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an attack on ASR systems based on user-customized style transfer. |
WEIFEI JIN et. al. | arxiv-cs.SD | 2024-05-15 |
446 | I Know What You Mean: Context-Aware Recognition to Enhance Speech-Based Games Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent advances in language processing and speech recognition open up a large opportunity for video game companies to embrace voice interaction as an intuitive feature and … |
Nima Zargham; Mohamed Lamine Fetni; Laura Spillner; Thomas Muender; Rainer Malaka; | Proceedings of the CHI Conference on Human Factors in … | 2024-05-11 |
447 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a simple yet effective method to learn a universal acoustic realization of Whisper’s $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting’ the model. |
Vyas Raina; Rao Ma; Charles McGhee; Kate Knill; Mark Gales; | arxiv-cs.CL | 2024-05-09 |
448 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | arxiv-cs.CL | 2024-05-09 |
449 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. |
JINGGUANG TIAN et. al. | arxiv-cs.SD | 2024-05-08 |
450 | Open Implementation and Study of BEST-RQ for Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. |
Ryan Whetten; Titouan Parcollet; Marco Dinarelli; Yannick Estève; | arxiv-cs.CL | 2024-05-07 |
451 | Mixat: A Data Set of Bilingual Emirati-English Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. |
Maryam Al Ali; Hanan Aldarmaki; | arxiv-cs.CL | 2024-05-04 |
452 | Unveiling The Potential of LLM-Based ASR on Chinese Open-Source Datasets Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. |
XUELONG GENG et. al. | arxiv-cs.SD | 2024-05-03 |
453 | Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new … |
A. Bekarystankyzy; O. Mamyrbayev; Tolganay Anarbekova; | ACM Transactions on Asian and Low-Resource Language … | 2024-05-03 |
454 | Towards Fair and Inclusive Speech Recognition for Stuttering: Community-led Chinese Stuttered Speech Dataset Creation and Benchmarking Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Despite the widespread adoption of Automatic Speech Recognition (ASR) models in voice-operated products and conversational AI agents, current ASR models perform poorly for people … |
Qisheng Li; Shaomei Wu; | Extended Abstracts of the CHI Conference on Human Factors … | 2024-05-02 |
455 | Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. |
FRANCISCO TEIXEIRA et. al. | arxiv-cs.LG | 2024-05-02 |
456 | Low-resource Speech Recognition and Dialect Identification of Irish in A Multi-task Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). |
Liam Lonergan; Mengjie Qian; Neasa Ní Chiaráin; Christer Gobl; Ailbhe Ní Chasaide; | arxiv-cs.CL | 2024-05-02 |
457 | Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. |
Dongyuan Li; Ying Zhang; Yusong Wang; Funakoshi Kataro; Manabu Okumura; | arxiv-cs.SD | 2024-05-01 |
458 | Efficient Compression of Multitask Multilingual Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. |
Thomas Palmeira Ferraz; | arxiv-cs.CL | 2024-05-01 |
459 | Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Confidence scores of automatic speech recognition (ASR) outputs are often inadequately communicated, preventing its seamless integration into analytical workflows. In this paper, we introduce ConFides, a visual analytic system developed in collaboration with intelligence analysts to address this issue. |
Sunwoo Ha; Chaehun Lim; R. Jordan Crouser; Alvitta Ottley; | arxiv-cs.HC | 2024-04-30 |
460 | Toward Robust ASR System Against Audio Adversarial Examples Using Agitated Logit Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples, which aim at deceiving ASR systems by adding perturbations to benign speech signals. These … |
N. Park; Jong Kim; | ACM Transactions on Privacy and Security | 2024-04-26 |
461 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. |
RUBEN JANSSENS et. al. | arxiv-cs.CL | 2024-04-26 |
462 | Automatic Speech Recognition System-Independent Word Error Rate Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. |
Chanho Park; Mingjie Chen; Thomas Hain; | arxiv-cs.CL | 2024-04-25 |
463 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. |
Chihiro Taguchi; Jefferson Saransig; Dayana Velásquez; David Chiang; | arxiv-cs.CL | 2024-04-23 |
464 | Analyzing The Impact of HF-Specific Signal Degradation on Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Analog radio still constitutes an important fraction of military communications. In the context of a fully digitized radio frequency (RF) reconnaissance chain, this leads to … |
Fabian Fritz; Alessia Cornaggia; Lukas Henneke; Frank Kurth; Kevin Wilkinghoff; | 2024 International Conference on Military Communication and … | 2024-04-23 |
465 | Enhancing ASR Performance Through Relative Word Frequency in OCR and Normal Word Frequency Analysis Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With the growing interest in Conversational AI, a system that enables machines to engage in human-like dialogues, there has been an increased focus on Automatic Speech Recognition … |
KYUDAN JUNG et. al. | 2024 IEEE 6th International Conference on AI Circuits and … | 2024-04-22 |
466 | Semantically Corrected Amharic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. |
Samuael Adnew; Paul Pu Liang; | arxiv-cs.CL | 2024-04-20 |
467 | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. |
Ye Bai; Chenxing Li; Hao Li; Yuanyuan Zhao; Xiaorui Wang; | arxiv-cs.SD | 2024-04-17 |
468 | Keep Decoding Parallel With Effective Knowledge Distillation From Language Models To End-To-End Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
M. Hentschel; Y. Nishikawa; T. Komatsu; Y. Fujita; | icassp | 2024-04-15 |
469 | Unsupervised Speech Recognition with N-skipgram and Positional Unigram Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. |
L. Wang; M. Hasegawa-Johnson; C. D. Yoo; | icassp | 2024-04-15 |
470 | Automatic Channel Selection and Spatial Feature Integration for Multi-Channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
B. MU et. al. | icassp | 2024-04-15 |
471 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning with Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
K. Everson; | icassp | 2024-04-15 |
472 | Automatic Speech Recognition Tuned for Child Speech in The Classroom Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By tuning OpenAI’s Whisper model we achieve a 38% relative reduction in word error rate (WER) to 9.2% on the public MyST dataset of child speech – the lowest yet reported – and a 7% relative reduction to reach 54% WER on a more challenging classroom speech dataset (ISAT). |
R. Southwell; | icassp | 2024-04-15 |
473 | Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes two novel variants of neural reprogramming to enhance wake word recognition in streaming end-to-end ASR models without updating model weights. |
P. -J. Ku; | icassp | 2024-04-15 |
474 | SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes the first known end-to-end frame-work, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
C. -J. Lin; | icassp | 2024-04-15 |
475 | Synthetic Conversations Improve Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, We propose a novel methodology called Systematic Synthetic Conversations (SSC), which leverages conventional ASR datasets to help an end-to-end (E2E) multi-talker ASR model establish new state-of-the-art results across synthetic and authentic multi-talker datasets. |
T. -B. Nguyen; A. Waibel; | icassp | 2024-04-15 |
476 | Contextual Biasing Methods for Improving Rare Word Detection in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Typically, such words have limited occurrences in training data, making it impractical to retrain the ASR system. This paper explores innovative word-boosting techniques to improve the detection rate of such rare words in the ASR hypotheses for the ATC domain. |
M. Bhattacharjee; | icassp | 2024-04-15 |
477 | Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the learning targets of automatic speech recognition (ASR) and other attribute recognition are apparently in conflict. Therefore, we propose to employ different adaptation methods for different tasks in multiple finetuning stages. |
Y. Gao; H. Shi; C. Chu; T. Kawahara; | icassp | 2024-04-15 |
478 | Leveraging Data Collection and Unsupervised Learning for Code-Switched Tunisian Arabic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. |
A. A. B. Abdallah; A. Kabboudi; A. Kanoun; S. Zaiem; | icassp | 2024-04-15 |
479 | Stable Distillation: Regularizing Continued Pre-Training for Low-Resource Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes Stable Distillation, a simple and novel approach for SSL-based continued pre-training that boosts ASR performance in the target domain where both labeled and unlabeled data are limited. |
A. Seth; S. Ghosh; S. Umesh; D. Manocha; | icassp | 2024-04-15 |
480 | Multilingual Distilwhisper: Efficient Distillation of Multi-Task Speech Models Via Language-Specific Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. |
T. P. Ferraz; M. Zanon Boito; C. Brun; V. Nikoulina; | icassp | 2024-04-15 |
481 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
S. Shen; Y. Gao; F. Liu; H. Wang; A. Zhou; | icassp | 2024-04-15 |
482 | Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. |
S. Papi; | icassp | 2024-04-15 |
483 | Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. |
S. Ling; | icassp | 2024-04-15 |
484 | Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). |
Y. Sudo; M. Shakeel; Y. Fukumoto; Y. Peng; S. Watanabe; | icassp | 2024-04-15 |
485 | Large Language Models As A Proxy For Human Evaluation In Assessing The Comprehensibility Of Disordered Speech Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. In this work, we tune and evaluate large language models for this task and find them to be a much better proxy for human evaluators than other metrics commonly used. |
K. Tomanek; | icassp | 2024-04-15 |
486 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
M. Jagielski; O. Thakkar; L. Wang; | icassp | 2024-04-15 |
487 | BRAVEn: Improving Self-supervised Pre-training for Visual and Auditory Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
A. Haliassos; A. Zinonos; R. Mira; S. Petridis; M. Pantic; | icassp | 2024-04-15 |
488 | Augmenting Conformers With Structured State-Space Sequence Models For Online Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. |
H. SHAN et. al. | icassp | 2024-04-15 |
489 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term bi-level joint unsupervised and supervised training (BL-JUST). |
A. F. M. SAIF et. al. | icassp | 2024-04-15 |
490 | Corpus Synthesis for Zero-Shot ASR Domain Adaptation Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. |
H. Su; | icassp | 2024-04-15 |
491 | Speech Collage: Code-Switched Audio Generation By Collaging Monolingual Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. |
A. Hussein; | icassp | 2024-04-15 |
492 | Sparsely Shared Lora on Whisper for Child Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. |
W. Liu; Y. Qin; Z. Peng; T. Lee; | icassp | 2024-04-15 |
493 | Leveraging Large Language Models for Exploiting ASR Uncertainty Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. |
P. Dighe; | icassp | 2024-04-15 |
494 | Cross-Modal Parallel Training for Improving End-to-end Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a Cross-modal Parallel Training (CPT) approach for improving the accent robustness of state-of-the-art Conformer-Transducer (Conformer-T) ASR system. |
R. Dong; Y. Li; D. Xu; Y. Long; | icassp | 2024-04-15 |
495 | Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. |
W. Chen; T. Kano; A. Ogawa; M. Delcroix; S. Watanabe; | icassp | 2024-04-15 |
496 | FastInject: Injecting Unpaired Text Data Into CTC-Based ASR Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint training method, named FastInject, which efficiently injects multi-domain unpaired text data into CTC-based ASR training. |
K. Deng; P. C. Woodland; | icassp | 2024-04-15 |
497 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
H. Wang; P. Guo; P. Zhou; L. Xie; | icassp | 2024-04-15 |
498 | Exploring Adapters with Conformers for Children’s Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we explore an alternative approach known as Adapter transfer. |
T. Rolland; A. Abad; | icassp | 2024-04-15 |
499 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. Ronny Huang; | icassp | 2024-04-15 |
500 | Significant ASR Error Detection for Conversational Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a system that can determine, to a high degree of accuracy, whether the semantics of a predicted and reference transcript are significantly different. |
J. Harvill; | icassp | 2024-04-15 |
501 | Towards Automatic Data Augmentation for Disordered Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. |
Z. Jin; | icassp | 2024-04-15 |
502 | Dementia Assessment Using Mandarin Speech with An Attention-Based Speech Recognition Encoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. |
Z. -J. LIN et. al. | icassp | 2024-04-15 |
503 | Task Vector Algebra for ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose two novel applications of task vectors to ASR. |
G. Ramesh; K. Audhkhasi; B. Ramabhadran; | icassp | 2024-04-15 |
504 | Improving Attention-Based End-to-End Speech Recognition By Monotonic Alignment Attention Matrix Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the contrary, some studies have shown that for non-streaming attention-based models, monotonic alignment is beneficial to model performance. Based on this motivation, we propose the enhanced Gaussian Monotonic Alignment (e-GMA), which reduces the difficulty of learning monotonic alignment, and the reconstructed attention matrix leads to an improved accuracy in ASR tasks. |
Z. Zhuang; | icassp | 2024-04-15 |
505 | LITEVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
H. LAUX et. al. | icassp | 2024-04-15 |
506 | Improved Children’s Automatic Speech Recognition Combining Adapters and Synthetic Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we use Adapters to handle the domain mismatch when fine-tuning with TTS data. |
T. Rolland; A. Abad; | icassp | 2024-04-15 |
507 | LCB-Net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
F. Yu; H. Wang; X. Shi; S. Zhang; | icassp | 2024-04-15 |
508 | Connecting Speech Encoder and Large Language Model for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. |
W. Yu; | icassp | 2024-04-15 |
509 | AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise.When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses.This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. |
J. Lin; | icassp | 2024-04-15 |
510 | Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). |
J. Xie; | icassp | 2024-04-15 |
511 | Task Oriented Dialogue As A Catalyst for Self-Supervised Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. |
D. M. Chan; S. Ghosh; H. Tulsiani; A. Rastrow; B. Hoffmeister; | icassp | 2024-04-15 |
512 | Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. |
Y. Yang; | icassp | 2024-04-15 |
513 | Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
H. Wang; | icassp | 2024-04-15 |
514 | Loss Masking Is Not Needed In Decoder-Only Transformer For Discrete-Token-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose to model speech tokens in an autoregressive way, similar to text. |
Q. Chen; | icassp | 2024-04-15 |
515 | Prompting Large Language Models with Speech Recognition Abilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we extend the capabilities of LLM by directly attaching a small audio encoder allowing it to perform speech recognition. |
Y. Fathullah; | icassp | 2024-04-15 |
516 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
J. He; X. Shi; X. Li; T. Toda; | icassp | 2024-04-15 |
517 | VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. |
S. MAITI et. al. | icassp | 2024-04-15 |
518 | Multi-Modality Speech Recognition Driven By Background Visual Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we combined the AVNS dataset (providing background sound) with the largest benchmark LRS3 dataset (providing target speech) to create adverse noise conditions for the AVSR model. |
C. Luo; Y. Liu; W. Sun; Z. Sun; | icassp | 2024-04-15 |
519 | Build A 50+ Hours Chinese Mandarin Corpus for Children’s Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The purpose of our research is to establish a children’s speech corpus for children’s speech recognition. |
H. Xu; J. Yang; J. Wang; W. Hu; | icassp | 2024-04-15 |
520 | Concealing Medical Condition By Node Toggling in ASR for Dementia Patients Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on learning ASR for dementia patients without revealing their medical condition. |
W. -T. Hsu; C. -P. Chen; C. -C. Lee; | icassp | 2024-04-15 |
521 | Boosting End-to-End Multilingual Phoneme Recognition Through Exploiting Universal Speech Attributes Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. |
H. Yen; S. M. Siniscalchi; C. -H. Lee; | icassp | 2024-04-15 |
522 | Enhancing Code-Switching Speech Recognition With Interactive Language Biases Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The interaction between various resolutions of language biases is subsequently explored in this work. |
H. Liu; L. P. Garcia; X. Zhang; A. W. H. Khong; S. Khudanpur; | icassp | 2024-04-15 |
523 | Extending Whisper with Prompt Tuning to Target-Speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
H. Ma; Z. Peng; M. Shao; J. Li; J. Liu; | icassp | 2024-04-15 |
524 | Personalization of CTC-Based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. |
Z. Lei; | icassp | 2024-04-15 |
525 | Inappropriate Pause Detection in Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
J. Lee; Y. Choi; T. -J. Song; M. -W. Koo; | icassp | 2024-04-15 |
526 | SCORE: Self-Supervised Correspondence Fine-Tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
A. Meghanani; T. Hain; | icassp | 2024-04-15 |
527 | Are Soft Prompts Good Zero-Shot Learners for Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). |
D. Ng; | icassp | 2024-04-15 |
528 | ViLaS: Exploring The Effects of Vision and Language Context in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore various cross-modal fusion schemes, analyze fine-grained cross-modal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition. |
Z. Ni; | icassp | 2024-04-15 |
529 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
C. Simic; T. Bocklet; | icassp | 2024-04-15 |
530 | Correction Focused Language Model Training For Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. |
Y. Ma; Z. Liu; O. Kalinli; | icassp | 2024-04-15 |
531 | Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, when the target speaker is not accurately separated, ASR models face limitations in reaching their peak performance. To address this issue, we propose a speaker-adaptive ASR framework that possesses more implicit target speaker enhancement capability by efficiently joint-optimized speaker recognition (SR) and ASR models. |
M. Wu; | icassp | 2024-04-15 |
532 | PromptASR for Contextualized ASR with Controllable Style Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. |
X. Yang; | icassp | 2024-04-15 |
533 | USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. |
S. Ding; | icassp | 2024-04-15 |
534 | Improving Kinyarwanda Speech Recognition Via Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we empirically show that using self-supervised pretraining, following a curriculum schedule during fine-tuning and using semi-supervised learning improve speech recognition for Kinyarwanda. |
A. Nzeyimana; | icassp | 2024-04-15 |
535 | Small-Footprint Automatic Speech Recognition System Using Two-Stage Transfer Learning Based Symmetrized Ternary Weight Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional automatic speech recognition (ASR) models face challenges when deployed on edge devices due to their high computational requirements and storage demands. To address this issue, we present a novel ASR system specifically designed for edge applications, encompassing both keyword spotting (KWS) and speaker verification (SV) functionalities with on chip learning for speaker registration. |
X. Zhang; H. Kou; C. Xia; H. Cai; B. Liu; | icassp | 2024-04-15 |
536 | AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Adaptive Maximum Entropy Regularization (AdaMER), a technique that can modulate the impact of entropy regularization throughout the training process. |
S. EOM et. al. | icassp | 2024-04-15 |
537 | Extending Large Language Models for Speech and Audio Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, automatic speech recognition (ASR) and automatic audio captioning (AAC) are often achieved with separate systems, resulting in incomplete auditory perception abilities. To fill in these gaps, in this paper, we present the first study that achieves both ASR and AAC by connecting an LLM with auditory encoders. |
C. Tang; | icassp | 2024-04-15 |
538 | SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. |
H. WANG et. al. | icassp | 2024-04-15 |
539 | Accent-Specific Vector Quantization for Joint Unsupervised and Supervised Training in Accent Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: How to effectively use limited supervised accent data to improve the accented ASR is of paramount importance. In this work, we propose an accent-specific quantization for joint unsupervised and supervised training (AQ-JUST) of end-to-end ASR models to address this issue. |
L. Li; Y. Li; D. Xu; H. Wei; Y. Long; | icassp | 2024-04-15 |
540 | SALM: Speech-Augmented Language Model with In-Context Learning for Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. |
Z. Chen; | icassp | 2024-04-15 |
541 | A Study on The Adverse Impact of Synthetic Speech on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, synthetic speech is prone to be mixed with real human speech as part of the noise and recorded by the microphone, which leads to performance decrease for speech recognition. To address this issue, we propose different methods to study the adverse impact of synthetic speech on speech recognition, thereby enhancing its robustness. |
J. Huang; Y. Bai; Y. Cai; W. Bian; | icassp | 2024-04-15 |
542 | Can Whisper Perform Speech-Based In-Context Learning? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. |
S. Wang; C. -H. Yang; J. Wu; C. Zhang; | icassp | 2024-04-15 |
543 | Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, there is still substantial space to improve as complex computation of visual modules and ineffective fusion of audio-visual modalities. To eliminate these drawbacks, we propose a down-up sampling-based AVSR model (Hourglass-AVSR) to enjoy high efficiency and performance, whose time length is scaled during the intermediate processing, resembling an hourglass. |
F. Yu; H. Wang; Z. Ma; S. Zhang; | icassp | 2024-04-15 |
544 | One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). |
S. Cornell; J. -W. Jung; S. Watanabe; S. Squartini; | icassp | 2024-04-15 |
545 | End-to-End Speech Translation with Mutual Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we find that triple-task MTL (ST+MT+ASR) suffers from a knowledge transfer limitation that leads to performance stagnation compared with dual-task MTL (ST+MT or ST+ASR). |
H. Wang; Z. Xue; Y. Lei; D. Xiong; | icassp | 2024-04-15 |
546 | Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the E2E architecture doesn’t explicitly address the modeling of overlapping speech areas, potentially limiting the model’s ability to generalize. To tackle this issue, we introduce two approaches: overlap-aware encoding method and monotonic attention loss. |
T. LI et. al. | icassp | 2024-04-15 |
547 | Attention-Guided Adaptation for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. |
B. Aditya; M. Rohmatillah; L. -H. Tai; J. -T. Chien; | icassp | 2024-04-15 |
548 | Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. |
E. Kim; Y. Tang; T. Ki; D. Neelagiri; V. R. Apsingek; | icassp | 2024-04-15 |
549 | Learning Speech Representation from Contrastive Token-Acoustic Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named Contrastive Token-Acoustic Pretraining (CTAP), which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. |
C. Qiang; | icassp | 2024-04-15 |
550 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the code-switching phenomenon. |
T. -T. Yang; H. -W. Wang; Y. -C. Wang; C. -H. Lin; B. Chen; | icassp | 2024-04-15 |
551 | CSNet: Contrastive Siamese Network for Robust SLU Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a siamese network with contrastive learning to enhance SLU effects. |
H. Yang; M. Zhang; D. Wei; J. Guo; | icassp | 2024-04-15 |
552 | Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we explore the problem of personalization from a domain adaptation perspective and highlight the potential risk of overfitting associated with synthesized speech. |
D. -H. Kim; J. -H. Lee; J. -H. Chang; | icassp | 2024-04-15 |
553 | Folding Attention: Memory and Power Optimization for On-Device Transformer-Based Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage.To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. |
Y. Li; | icassp | 2024-04-15 |
554 | Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, generalized LoRA is used to refine the state-of-the-art cascaded conformer transducer model. |
A. Baby; G. Joseph; S. Singh; | icassp | 2024-04-15 |
555 | How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of processing distortion generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. |
K. Iwamoto; | icassp | 2024-04-15 |
556 | Updated Corpora and Benchmarks for Long-Form Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we re-release three standard ASR corpora—TED-LIUM 3, Gigapeech, and VoxPopulien—with updated transcription and alignments to enable their use for long-form ASR research. |
J. D. FOX et. al. | icassp | 2024-04-15 |
557 | Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a multitask alternative to the joint training approach. |
S. Kumar; | icassp | 2024-04-15 |
558 | Investigating End-to-End ASR Architectures for Long Form Audio Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audio. |
N. R. Koluguri; | icassp | 2024-04-15 |
559 | Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: For voice assistant systems, personalizing automated speech recognition (ASR) to a customer is the proverbial holy grail. Careful selection of hyper-parameters will be necessary … |
Arun Baby; George Joseph; Shatrughan Singh; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
560 | Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper proposes two novel variants of neural reprogramming to enhance wake word recognition in streaming end-to-end ASR models without updating model weights. The first, … |
PIN-JUI KU et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
561 | Synthetic Conversations Improve Multi-Talker ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In recent times, automatic speech recognition (ASR) has seen remarkable progress, particularly in recognizing dominant speakers. Nevertheless, the challenge of multi-talker … |
Thai-Binh Nguyen; Alexander Waibel; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
562 | End-to-End Speech Translation with Mutual Knowledge Distillation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-task learning (MTL) is widely used to improve end-to-end speech translation (ST), which implicitly transfer knowledge from auxiliary automatic speech recognition (ASR) … |
Hao Wang; Zhengshan Xue; Yikun Lei; Deyi Xiong; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
563 | Task Vector Algebra for ASR Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Vector representations of text and speech signals such as word2vec and wav2vec are used commonly in automatic speech recognition (ASR) and spoken language understanding systems. … |
Gowtham Ramesh; Kartik Audhkhasi; B. Ramabhadran; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
564 | Large Language Models As A Proxy For Human Evaluation In Assessing The Comprehensibility Of Disordered Speech Transcription Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered … |
KATRIN TOMANEK et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
565 | A Study on The Adverse Impact of Synthetic Speech on Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The high-quality synthetic speech by TTS has been widely used in the field of human-computer interaction, bringing users better experience. However, synthetic speech is prone to … |
Jian Huang; Yancheng Bai; Yang Cai; Wei Bian; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
566 | Automatic Speech Recognition Tuned for Child Speech in The Classroom Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: K-12 school classrooms have proven to be a challenging environment for Automatic Speech Recognition (ASR) systems, both due to background noise and conversation, and differences … |
ROSY SOUTHWELL et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
567 | Extending Large Language Models for Speech and Audio Captioning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet … |
CHANGLI TANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
568 | The Fosafer System for The ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents the Fosafer’s submissions to the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), which includes both the Automatic Speech … |
Shangkun Huang; Yuxuan Du; Yankai Wang; Jing Deng; Rong Zheng; | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
569 | Improved Children’s Automatic Speech Recognition Combining Adapters and Synthetic Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Children’s automatic speech recognition (ASR) poses a significant challenge due to the high variability nature of children’s speech. The limited availability of training datasets … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
570 | Exploring Adapters with Conformers for Children’s Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The high variability in acoustic, pronunciation, and linguistic characteristics of children’s speech makes of children’s automatic speech recognition (ASR) a complex task. … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
571 | Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by … |
SHASHI KUMAR et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
572 | Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study investigates the effective finetuning of a pretrained model using adapters for speech emotion recognition (SER). Since emotion is related with linguistic and prosodic … |
Yuan Gao; Hao Shi; Chenhui Chu; Tatsuya Kawahara; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
573 | Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging … |
William Chen; Takatomo Kano; A. Ogawa; Marc Delcroix; Shinji Watanabe; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
574 | SIR-Progressive Audio-Visual TF-Gridnet with ASR-Aware Selector for Target Speaker Extraction in MISP 2023 Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: TF-GridNet has demonstrated its effectiveness in speech separation and enhancement. In this paper, we extend its capabilities for progressive audio-visual speech enhancement by … |
ZHONGSHU HOU et. al. | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
575 | Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) multi-speaker speech recognition with the serialized output training (SOT) strategy demonstrates good performance in modeling diverse speaker scenarios. However, … |
TAO LI et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
576 | Automatic Speech Recognition Advancements for Indigenous Languages of The Americas Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. |
Monica Romero; Sandra Gomez; Ivan G. Torre; | arxiv-cs.CL | 2024-04-12 |
577 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. |
Tien-Hong Lo; Fu-An Chao; Tzu-I Wu; Yao-Ting Sung; Berlin Chen; | arxiv-cs.SD | 2024-04-11 |
578 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in The Medical Domain Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present VietMed – a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. |
Khai Le-Duc; | arxiv-cs.CL | 2024-04-08 |
579 | Mai Ho’omāuna I Ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. |
Kaavya Chaparala; Guido Zarrella; Bruce Torres Fischer; Larry Kimura; Oiwi Parker Jones; | arxiv-cs.CL | 2024-04-03 |
580 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
Matthew Jagielski; Om Thakkar; Lun Wang; | arxiv-cs.LG | 2024-04-02 |
581 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
Alexandros Haliassos; Andreas Zinonos; Rodrigo Mira; Stavros Petridis; Maja Pantic; | arxiv-cs.CV | 2024-04-02 |
582 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
Siyuan Shen; Yu Gao; Feng Liu; Hanyang Wang; Aimin Zhou; | arxiv-cs.SD | 2024-03-28 |
583 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. |
YASH JAIN et. al. | arxiv-cs.CL | 2024-03-28 |
584 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. |
Yi-Cheng Wang; Hsin-Wei Wang; Bi-Cheng Yan; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-03-26 |
585 | More Than Words: Advancements and Challenges in Speech Recognition for Singing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. |
Anna Kruspe; | arxiv-cs.SD | 2024-03-14 |
586 | A Review on Gujarati Language Based Automatic Speech Recognition (ASR) Systems Related Papers Related Patents Related Grants Related Venues Related Experts View |
Mohit Dua; Bhavesh Bhagat; Shelza Dua; N. Chakravarty; | Int. J. Speech Technol. | 2024-03-12 |
587 | Automatic Speech Recognition (ASR) for The Diagnosis of Pronunciation of Speech Sound Disorders in Korean Children Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. |
TAEKYUNG AHN et. al. | arxiv-cs.CL | 2024-03-12 |
588 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. |
Jiayu Du; Jinpeng Li; Guoguo Chen; Wei-Qiang Zhang; | arxiv-cs.CL | 2024-03-12 |
589 | Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While Automatic Speech Recognition (ASR) systems excel in controlled environments, challenges arise in robot-specifc setups due to unique microphone requirements and added noise … |
Antonio Andriella; Raquel Ros; Yoav Ellinson; Sharon Gannot; S. Lemaignan; | 2024 19th ACM/IEEE International Conference on Human-Robot … | 2024-03-11 |
590 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-03-10 |
591 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. |
Yufeng Yang; Ashutosh Pandey; DeLiang Wang; | arxiv-cs.SD | 2024-03-10 |
592 | A New Benchmark for Evaluating Automatic Speech Recognition in The Arabic Call Domain Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. |
QUSAI ABO OBAIDAH et. al. | arxiv-cs.AI | 2024-03-07 |
593 | Kirigami Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) … |
Sudershan Boovaraghavan; Haozhe Zhou; Mayank Goel; Yuvraj Agarwal; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2024-03-06 |
594 | JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent … |
Chang Sun; Hong Yang; Bo Qin; | ArXiv | 2024-03-04 |
595 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) … |
Joonas Kalda; Clément Pagés; R. Marxer; Tanel Alumäe; Hervé Bredin; | The Speaker and Language Recognition Workshop | 2024-03-04 |
596 | Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This survey offers a comprehensive review of DTL, FL, and RL-based ASR frameworks, aiming to provide insights into the latest developments and aid researchers and professionals in understanding the current challenges. |
Hamza Kheddar; Mustapha Hemis; Yassine Himeur; | arxiv-cs.SD | 2024-03-02 |
597 | A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio … |
Tyler Benster; G. Wilson; Reshef Elisha; Francis R. Willett; S. Druckmann; | ArXiv | 2024-03-02 |
598 | Towards Inclusive Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View |
Siyuan Feng; B. Halpern; O. Kudina; O. Scharenborg; | Comput. Speech Lang. | 2024-03-01 |
599 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. |
Heyang Liu; Yu Wang; Yanfeng Wang; | arxiv-cs.CL | 2024-03-01 |
600 | Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
Jeehyun Lee; Yerin Choi; Tae-Jin Song; Myoung-Wan Koo; | arxiv-cs.CL | 2024-02-29 |
601 | Probing The Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). |
Quentin Raymondaud; Mickael Rouvier; Richard Dufour; | arxiv-cs.SD | 2024-02-29 |
602 | Exploration of Adapter for Noise Robust Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study thoroughly investigates adapter-based ASR adaptation in noisy environments. |
Hao Shi; Tatsuya Kawahara; | arxiv-cs.SD | 2024-02-28 |
603 | A Multitask Co-training Framework for Improving Speech Translation By Leveraging Speech Recognition and Machine Translation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View |
Yue Zhou; Yuxuan Yuan; Xiaodong Shi; | Neural Comput. Appl. | 2024-02-27 |
604 | Large Language Models Are Efficient Learners of Noise-Robust Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM.The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. |
YUCHEN HU et. al. | iclr | 2024-02-26 |
605 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the codeswitching phenomenon. |
Tzu-Ting Yang; Hsin-Wei Wang; Yi-Cheng Wang; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-02-26 |
606 | It’s Never Too Late: Fusing Acoustic Information Into Large Language Models for Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). |
CHEN CHEN et. al. | iclr | 2024-02-26 |
607 | LipVoicer: Generating Speech from Silent Videos Guided By Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. |
Yochai Yemini; Aviv Shamsian; Lior Bracha; Sharon Gannot; Ethan Fetaya; | iclr | 2024-02-26 |
608 | Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. |
Qiushi Zhu; Jie Zhang; Yu Gu; Yuchen Hu; Lirong Dai; | aaai | 2024-02-20 |
609 | OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). |
Yifan Peng; Yui Sudo; Muhammad Shakeel; Shinji Watanabe; | arxiv-cs.CL | 2024-02-19 |
610 | Phantom in The Opera: Adversarial Music Attack for Robot Dialogue System Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study explores the vulnerability of robot dialogue systems’ automatic speech recognition (ASR) module to adversarial music attacks. Specifically, we explore music as a … |
Sheng Li; Jiyi Li; Yang Cao; | Frontiers Comput. Sci. | 2024-02-15 |
611 | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | arxiv-cs.CL | 2024-02-13 |
612 | The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems’ performance in multilingual settings. |
Ajinkya Kulkarni; Anna Tokareva; Rameez Qureshi; Miguel Couceiro; | arxiv-cs.CL | 2024-02-12 |
613 | Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | arxiv-cs.CL | 2024-02-08 |
614 | A Comprehensive Study of The Current State-of-the-Art in Nepali Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). |
Rupak Raj Ghimire; Bal Krishna Bal; Prakash Poudyal; | arxiv-cs.SD | 2024-02-05 |
615 | Digits Micro-model for Accurate and Secure Transactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. |
Chirag Chhablani; Nikhita Sharma; Jordan Hosier; Vijay K. Gurbani; | arxiv-cs.LG | 2024-02-02 |
616 | Streaming Sequence Transduction Through Dynamic Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. |
WEITING TAN et. al. | arxiv-cs.CL | 2024-02-02 |
617 | AccentFold: A Journey Through African Accents for Zero-Shot ASR Adaptation to Target Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). |
Abraham Toluwase Owodunni; Aditya Yadavalli; Chris Chinenye Emezue; Tobi Olatunji; Clinton C Mbataku; | arxiv-cs.CL | 2024-02-02 |
618 | Exploring The Limits of Decoder-only Models Trained on Public Speech Recognition Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. |
Ankit Gupta; George Saon; Brian Kingsbury; | arxiv-cs.CL | 2024-01-31 |
619 | Improving ASR Performance with OCR Through Using Word Frequency Difference IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently, there has been a growing interest in conversational artificial intelligence (AI). As a result, research is actively being conducted on automatic speech recognition (ASR) … |
Kyudan Jung; Seungmin Bae; N. Kim; Hyun Gon Ryu; Hyuk-Jae Lee; | 2024 International Conference on Electronics, Information, … | 2024-01-28 |
620 | Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research highlights the dependency of BPE subword tokenization’s efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. |
Ahnaf Mozib Samin; | arxiv-cs.CL | 2024-01-27 |
621 | Toward Practical Automatic Speech Recognition and Post-Processing: A Call for Explainable Error Benchmark Guideline Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. |
SEONMIN KOO et. al. | arxiv-cs.CL | 2024-01-25 |
622 | SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
CHYI-JIUNN LIN et. al. | arxiv-cs.CL | 2024-01-24 |
623 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
Jiajun He; Xiaohan Shi; Xingfeng Li; Tomoki Toda; | arxiv-cs.CL | 2024-01-24 |
624 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. RONNY HUANG et. al. | arxiv-cs.CL | 2024-01-23 |
625 | Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
Michael Hentschel; Yuta Nishikawa; Tatsuya Komatsu; Yusuke Fujita; | arxiv-cs.CL | 2024-01-22 |
626 | Using Large Language Model for End-to-End Chinese ASR and NER Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. |
YUANG LI et. al. | arxiv-cs.CL | 2024-01-20 |
627 | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. |
Hao Wang; Shuhei Kurita; Shuichiro Shimizu; Daisuke Kawahara; | arxiv-cs.CV | 2024-01-18 |
628 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. |
A F M SAIF et. al. | arxiv-cs.CL | 2024-01-13 |
629 | LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
Fan Yu; Haoxu Wang; Xian Shi; Shiliang Zhang; | arxiv-cs.SD | 2024-01-12 |
630 | XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This research paper focuses on the development and evaluation of Automatic Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims to improve ASR … |
Panji Arisaputra; Alif Tri Handoyo; Amalia Zahra; | ArXiv | 2024-01-12 |
631 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. |
JIAXIN GUO et. al. | arxiv-cs.CL | 2024-01-11 |
632 | End to End Hindi to English Speech Conversion Using Bark, MBART and A Finetuned XLSR Wav2Vec2 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a … |
Aniket Tathe; Anand Kamble; Suyash Kumbharkar; Atharva Bhandare; Anirban C. Mitra; | ArXiv | 2024-01-11 |
633 | Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: \textbf{Objectives}: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the “Cookie Theft” picture description task. |
Changye Li; Weizhe Xu; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-01-10 |
634 | A New MmWave-Speech Multimodal Speech System for Voice User Interface Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Voice user interface (VUI) plays an essential role in intelligent scenes, e.g., smart homes. It provides a hands- and eyes-free human-machine interaction between humans and … |
Tiantian Liu; Feng Lin; | GetMobile: Mobile Computing and Communications | 2024-01-08 |
635 | High-precision Voice Search Query Correction Via Retrievable Speech-text Embedings Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. |
CHRISTOPHER LI et. al. | arxiv-cs.CL | 2024-01-08 |
636 | An Audio-quality-based Multi-strategy Approach for Target Speaker Extraction in The MISP 2023 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2024-01-08 |
637 | Cross-Speaker Encoding Network for Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. |
JIAWEN KANG et. al. | arxiv-cs.SD | 2024-01-08 |
638 | ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. |
HE WANG et. al. | arxiv-cs.SD | 2024-01-07 |
639 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
He Wang; Pengcheng Guo; Pan Zhou; Lei Xie; | arxiv-cs.SD | 2024-01-07 |
640 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
KEVIN EVERSON et. al. | arxiv-cs.CL | 2024-01-05 |
641 | Research on The Application of Speech Database Based on Emotional Feature Extraction in International Chinese Education and Teaching Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advanced analysis of the relationship between acoustic and emotional characteristics of speech signals can effectively improve the interactivity and intelligence of computers. … |
Xiangli Zhang; | Scalable Comput. Pract. Exp. | 2024-01-04 |
642 | An Approach for Speech Enhancement in Low SNR Environments Using Granular Speaker Embedding Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The proliferation of speech technology applications has led to an unprecedented demand for effective speech enhancement techniques, particularly in low Signal-to-Noise Ratio (SNR) … |
Jayasree Saha; Rudrabha Mukhopadhyay; A. Agrawal; Surabhi Jain; C. V. Jawahar; | Proceedings of the 7th Joint International Conference on … | 2024-01-04 |
643 | Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. |
Rita Frieske; Bertram E. Shi; | arxiv-cs.CL | 2024-01-03 |
644 | Fine-Tuning ASR Models for Very Low-Resource Languages: A Study on Mvskoke Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent advancements in multilingual models for automatic speech recognition (ASR) have been able to achieve a high accuracy for languages with extremely limited resources. This … |
Julia Mainzinger; Gina-Anne Levow; | Annual Meeting of the Association for Computational … | 2024-01-01 |
645 | Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched … |
Sei Ueno; Akinobu Lee; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
646 | BanSpeech: A Multi-Domain Bangla Speech Recognition Benchmark Toward Robust Performance in Challenging Conditions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to … |
AHNAF MOZIB SAMIN et. al. | IEEE Access | 2024-01-01 |
647 | Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper proposes a novel meta-transfer learning method to improve automatic speech recognition (ASR) performance in low-resource languages. Nowadays, we are witnessing high … |
Rui Zhou; Takaki Koshikawa; Akinori Ito; Takashi Nose; Chia-Ping Chen; | IEEE Access | 2024-01-01 |
648 | ASQ: An Ultra-Low Bit Rate ASR-Oriented Speech Quantization Method Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: For efficient transmission of speech signals, speech compression methodologies have attracted significant research attention for decades and are widely used in automatic speech … |
Lingxuan Ye; Changfeng Gao; Gaofeng Cheng; Liuping Luo; Qingwei Zhao; | IEEE Signal Processing Letters | 2024-01-01 |
649 | Keyword Guided Target Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This letter presents a new target speech recognition problem, where the target speech is defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we … |
Yinghao Shi; Lantian Li; Dong Wang; Jiqing Han; | IEEE Signal Processing Letters | 2024-01-01 |
650 | Enhancing Automatic Speech Recognition With Personalized Models: Improving Accuracy Through Individualized Fine-Tuning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) systems have become increasingly popular in recent years due to their ability to convert spoken language into text. Nonetheless, despite their … |
V. BRYDINSKYI et. al. | IEEE Access | 2024-01-01 |
651 | Chinese Spoken Named Entity Recognition in Real-world Scenarios: Dataset and Approaches Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Spoken Named Entity Recognition (NER) aims 001 to extract entities from speech. The extracted 002 entities can help voice assistants better under-003 stand user’s questions and … |
SHILIN ZHOU et. al. | Annual Meeting of the Association for Computational … | 2024-01-01 |
652 | Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the accuracy and robustness of speech recognition systems with the assistance of visual cues in … |
Jiahong Li; Chenda Li; Yifei Wu; Yanmin Qian; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
653 | Arabic Speech Recognition: Advancement and Challenges Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recognition is a captivating process that revolutionizes human-computer interactions, allowing us to interact and control machines through spoken commands. The foundation … |
ASHIFUR RAHMAN et. al. | IEEE Access | 2024-01-01 |
654 | Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal … |
Lester Phillip Violeta; D. Ma; Wen-Chin Huang; T. Toda; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
655 | Exploring Native and Non-Native English Child Speech Recognition With Whisper Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Modern end-to-end Automatic Speech Recognition (ASR) systems struggle to recognise children’s speech. This challenge is due to the high acoustic variability in children’s voices … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | IEEE Access | 2024-01-01 |
656 | ESAformer: Enhanced Self-Attention for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and … |
Junhua Li; Zhikui Duan; Shiren Li; Xinmei Yu; Guangguang Yang; | IEEE Signal Processing Letters | 2024-01-01 |
657 | Explainability of Speech Recognition Transformers Via Gradient-Based Attention Visualization Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how … |
Tianli Sun; Haonan Chen; Guosheng Hu; Lianghua He; Cairong Zhao; | IEEE Transactions on Multimedia | 2024-01-01 |
658 | Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE … |
Hao Shi; M. Mimura; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
659 | Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. … |
Yukun Ma; Chong Zhang; Qian Chen; Wen Wang; Bin Ma; | IEEE Signal Processing Letters | 2024-01-01 |
660 | Semantic Role Labeling from Chinese Speech Via End-to-End Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Semantic Role Labeling (SRL), crucial for understanding semantic relationships in sentences, has traditionally focused on text-based input. However, the increasing use of voice … |
Huiyao Chen; Xinxin Li; Meishan Zhang; Min Zhang; | Annual Meeting of the Association for Computational … | 2024-01-01 |
661 | Mel-Scale Frequency Extraction and Classification of Dialect-Speech Signals With 1D CNN Based Classifier for Gender and Region Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Humans communicate and interact through natural languages, such as American English (AE), Taiwanese, Italian, and numerous variants of Spanish. Through automatic speech analysis … |
HSIANG-YUEH LAI et. al. | IEEE Access | 2024-01-01 |
662 | Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training … |
Geon Woo Lee; Hong Kook Kim; Duk-Jo Kong; | IEEE Access | 2024-01-01 |
663 | Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
HUIMENG WANG et. al. | arxiv-cs.SD | 2023-12-31 |
664 | KEBAP: Korean Error Explainable Benchmark Dataset for ASR and Post-processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional evaluation metrics for ASR systems produce a singular aggregate score, which is insufficient for understanding specific system vulnerabilities. Therefore, we aim to address the limitations of the previous ASR evaluation methods by introducing the Korean Error Explainable Benchmark Dataset for ASR and Post-processing (KEBAP). |
SEONMIN KOO et. al. | emnlp | 2023-12-22 |
665 | Accented Speech Recognition With Accent-specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. |
Darshan Prabhu; Preethi Jyothi; Sriram Ganapathy; Vinit Unni; | emnlp | 2023-12-22 |
666 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Pawel Sk�rzewski; Marcin Sowannski; Tomasz Zietkiewicz; | emnlp | 2023-12-22 |
667 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | emnlp | 2023-12-22 |
668 | CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. |
Zishan Guo; Linhao Yu; Minghui Xu; Renren Jin; Deyi Xiong; | emnlp | 2023-12-22 |
669 | Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we attempt to resolve structurally ambiguous utterances into unambiguous texts in Indonesian using prosodic information. |
RUHIYAH WIDIAPUTRI et. al. | emnlp | 2023-12-22 |
670 | CLAD-ST: Contrastive Learning with Adversarial Data for Robust Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address this robustness problem in downstream MT models by forcing the MT encoder to bring the representations of a noisy input closer to its clean version in the semantic space. This is achieved by introducing a contrastive learning method that leverages adversarial examples in the form of ASR outputs paired with their corresponding human transcripts to optimize the network parameters. |
Sathish Indurthi; Shamil Chollampatt; Ravi Agrawal; Marco Turchi; | emnlp | 2023-12-22 |
671 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
Christopher Simic; Tobias Bocklet; | arxiv-cs.SD | 2023-12-21 |
672 | Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. |
ANIRUDH S. SUNDAR et. al. | arxiv-cs.LG | 2023-12-21 |
673 | KNN-CTC: Enhancing ASR Via Retrieval of CTC Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2023-12-20 |
674 | SpokesBiz — An Open Corpus of Conversational Polish Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We outline the general structure and content of the corpus, showcasing selected applications in linguistic research, evaluation and improvement of automatic speech recognition (ASR) systems |
PIOTR PĘZIK et. al. | arxiv-cs.CL | 2023-12-19 |
675 | SpokesBiz – An Open Corpus of Conversational Polish Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper announces the early release of SpokesBiz, a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and comprising over 650 hours of … |
PIOTR PEZIK et. al. | ArXiv | 2023-12-19 |
676 | Arabic Speech Recognition Based on Self Supervised Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Arabic Speech Recognition (AASR) has gained significant attention in recent years due to its potential applications in various fields such as transcription, voice … |
Hiba Adreese Younis; Yusra Faisal Mohammad; | 2023 16th International Conference on Developments in … | 2023-12-18 |
677 | Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this … |
Peng Shen; Xugang Lu; Hisashi Kawai; | ArXiv | 2023-12-18 |
678 | Seq2seq for Automatic Paraphasia Detection in Aphasic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks. |
MATTHEW PEREZ et. al. | arxiv-cs.SD | 2023-12-16 |
679 | Towards Robust Packet Loss Concealment System With ASR-Guided Representations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Despite the significant advancements and promising performance of deep learning-based packet loss concealment (PLC) systems in transmission systems, their focus on modeling … |
Dali Yang; Joon-Hyuk Chang; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
680 | Ending The Blind Flight: Analyzing The Impact of Acoustic and Lexical Factors on WAV2VEC 2.0 in Air-Traffic Control Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transformer neural networks have shown remarkable success on standard automatic speech recognition (ASR) benchmarks. However, they are known to be less robust against domain … |
Alexander Blatt; Badr M. Abdullah; D. Klakow; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
681 | Hierarchical Attention-Based Contextual Biasing For Personalized Speech Recognition Using Neural Transducers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although end-to-end (E2E) automatic speech recognition (ASR) systems excel in general tasks, they frequently struggle with accurately recognizing personal rare words. Leveraging … |
Sibo Tong; Philip Harding; Simon Wiesler; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
682 | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transfer learning from large multilingual pretrained models, like XLSR, has become the new paradigm for Automatic Speech Recognition (ASR). Considering their ever-increasing size, … |
GEOFFROY VANDERREYDT et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
683 | Parameter-Efficient Cross-Language Transfer Learning for A Language-Modular Audiovisual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In audiovisual speech recognition (AV-ASR), for many languages only few audiovisual data is available. Building upon an English model, in this work, we first apply and analyze … |
ZHENGYANG LI et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
684 | Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Much of the recent progress in automatic speech recognition (ASR) lies in developing an acoustic encoder, such as enlarging its capacity and designing a refined architecture for … |
Yosuke Higuchi; Andrew Rosenberg; Yuan Wang; M. Baskar; B. Ramabhadran; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
685 | Conformer-Based Speech Recognition On Extreme Edge-Computing Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. |
MINGBIN XU et. al. | arxiv-cs.LG | 2023-12-16 |
686 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR Task Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Indonesia is home to roughly 700 languages, which amounts to about ten percent of the global total, positioning it as the second-most linguistically diverse country after Papua … |
S. Sakti; Benita Angela Titalim; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
687 | IFF-WAV2VEC: Noise Robust Low-Resource Speech Recognition Based on Self-supervised Learning and Interactive Feature Fusion Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In recent years, self-supervised learning representation (SSLR) has shown remarkable performance in low-resource speech recognition. However, it lacks consideration for the … |
Jing Cao; Zhaopeng Qian; Chongchong Yu; Tao Xie; | Proceedings of the 2023 6th Artificial Intelligence and … | 2023-12-16 |
688 | LiteVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
HENDRIK LAUX et. al. | arxiv-cs.CV | 2023-12-15 |
689 | Automatic Channel Selection and Spatial Feature Integration for Multi-channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
BINGSHEN MU et. al. | arxiv-cs.SD | 2023-12-15 |
690 | Knowledge Prompt for Whisper: An ASR Entity Correction Approach with Knowledge Base Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Entity correction is crucial in Automatic Speech TABLE I Recognition (ASR), since erroneous entities seriously affect our understanding of ASR results. In this paper, in order to … |
MIN ZHANG et. al. | 2023 IEEE International Conference on Big Data (BigData) | 2023-12-15 |
691 | Improvement of Automatic Speech Recognition Systems Utilizing 2D Adaptive Wavelet Transformation Applied to Recurrence Plot of Speech Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View |
S. Firooz; F. Almasganj; Yasser Shekofteh; | Signal, Image and Video Processing | 2023-12-15 |
692 | On The Compression of Shallow Non-causal ASR Models Using Knowledge Distillation and Tied-and-reduced Decoder for Low-latency On-device Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. |
NAGARAJ ADIGA et. al. | arxiv-cs.SD | 2023-12-15 |
693 | Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most past studies have simplified the learning complexity of the model by splitting the code-switching task into multiple tasks dealing with a single language and then learning the domain-specific knowledge of each language separately. Therefore, in this paper, we attempt to introduce language identification information into the middle layer of the ASR model’s encoder. |
Tzu-Ting Yang; Hsin-Wei Wang; Berlin Chen; | arxiv-cs.CL | 2023-12-15 |
694 | Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising … |
Fan Yu; Haoxu Wang; Ziyang Ma; Shiliang Zhang; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
695 | FastInject: Injecting Unpaired Text Data Into CTC-Based ASR Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the … |
Keqi Deng; Phil Woodland; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
696 | Towards Automatic Data Augmentation for Disordered Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data … |
ZENGRUI JIN et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
697 | Extending Whisper with Prompt Tuning to Target-speaker ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
Hao Ma; Zhiyuan Peng; Mingjie Shao; Jing Li; Ju Liu; | arxiv-cs.CL | 2023-12-13 |
698 | ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. |
Xincheng Yu; Dongyue Guo; Jianwei Zhang; Yi Lin; | arxiv-cs.SD | 2023-12-10 |
699 | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. |
Wonjun Lee; Gary Geunbae Lee; Yunsu Kim; | arxiv-cs.CL | 2023-12-06 |
700 | Taiwanese Hakka Across Taiwan Corpus and Formosa Speech Recognition Challenge 2023 – Hakka ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: To revive the endangered Taiwanese Hakka language, the first large-scale Taiwanese Hakka speech corpus across Taiwan (HAT) was developed, representing modern Taiwanese Hakka … |
YUAN-FU LIAO et. al. | 2023 26th Conference of the Oriental COCOSDA International … | 2023-12-04 |
701 | End-to-End Speech-to-Text Translation: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, researchers have been exploring end-to-end (E2E) models for ST translation. |
Nivedita Sethiya; Chandresh Kumar Maurya; | arxiv-cs.CL | 2023-12-02 |
702 | FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. |
Dongning Yang; Wei Wang; Yanmin Qian; | arxiv-cs.SD | 2023-11-29 |
703 | Research Applications of Hidden Markov Models in Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Reinforcement Learning, a vital branch of Machine Learning, has gained significant attention due to its interactive and goal-oriented learning approach. Its primary objective is … |
Zeng Li; Zhenzhen Wang; Xiaofei Sun; | Proceedings of the 2023 International Conference on … | 2023-11-18 |
704 | On The Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. |
Xiaohan Shi; Jiajun He; Xingfeng Li; Tomoki Toda; | arxiv-cs.SD | 2023-11-13 |
705 | Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. |
Qijie Shao; Pengcheng Guo; Jinghao Yan; Pengfei Hu; Lei Xie; | arxiv-cs.SD | 2023-11-12 |
706 | Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose to model speech tokens in an autoregressive way, similar to text. |
QIAN CHEN et. al. | arxiv-cs.CL | 2023-11-08 |
707 | Improved Child Text-to-Speech Synthesis Through Fastpitch-based Transfer Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. |
Rishabh Jain; Peter Corcoran; | arxiv-cs.SD | 2023-11-07 |
708 | Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. |
RABINDRA NATH NANDI et. al. | arxiv-cs.CL | 2023-11-06 |
709 | COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. |
JING PAN et. al. | arxiv-cs.CL | 2023-11-03 |
710 | Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models Via Language-Specific Experts IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. |
Thomas Palmeira Ferraz; Marcely Zanon Boito; Caroline Brun; Vassilina Nikoulina; | arxiv-cs.CL | 2023-11-02 |
711 | Multi-Self-Supervised Learning Model-Based Throat Microphone Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Throat microphones (TMs) can record sounds and mitigate the effects of external noise. Ongoing works seek to apply TMs to speech recognition in high-noise environments. However, … |
Kohta Masuda; Jun Ogata; Masafumi Nishida; Masafumi Nishimura; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
712 | Learning Adapters for Code-Switching Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multilingual code-switching speech recognition has been an emerging research direction in real-world applications since most of speakers are bilingual or multilingual. A … |
Chun-Yi He; Jen-Tzung Chien; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
713 | Speech Emotion Recognition By Late Fusion of Linguistic and Acoustic Features Using Deep Learning Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this study, we investigated speech emotion recognition using both linguistic and acoustic features contained in emotional speech. Speech recognition is necessary to extract … |
Kiyohide Sato; Keita Kishi; Tetsuo Kosaka; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
714 | Incorporating Pinyin Into Pipeline Named Entity Recognition from Chinese Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Named Entity Recognition (NER) from speech is usually implemented through a two-step pipeline that consists of (1) processing the audio using an Automatic Speech Recognition (ASR) … |
MIN ZHANG et. al. | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
715 | ASR Model Adaptation for Rare Words Using Synthetic Data Generated By Multiple Text-To-Speech Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) for rare words is difficult as there are little relevant text-audio data pairs to train an ASR model. To obtain more text-audio pairs, text-only … |
Kwok Chin Yuen; Haoyang Li; Chng Eng Siong; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
716 | Synthetic Data Augmentation for ASR with Domain Filtering Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly obtained … |
Tuan Vu Ho; Shota Horiguchi; Shinji Watanabe; Paola Garcia; Takashi Sumiyoshi; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
717 | Transformer-based Automatic Speech Recognition of Simultaneous Interpretation with Auxiliary Input of Source Language Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) of simultaneous interpretation is challenging due to disfluencies such as hesitations, filled pauses, interruptions, and self-repairs. … |
Shuta Taniguchi; Tsuneo Kato; Akihiro Tamura; Keiji Yasuda; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
718 | Language Modeling for Spontaneous Speech Recognition Based on Disfluency Labeling and Generation of Disfluent Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Disfluencies in spontaneous speech such as fillers and hesitations are major causes of automatic speech recognition (ASR) errors. In our previous work, we proposed a disfluency … |
Koharu Horii; Kengo Ohta; Ryota Nishimura; Atsunori Ogawa; N. Kitaoka; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
719 | Effective Fine-tuning Method for Tibetan Low-resource Dialect Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Tibetan is a distinctive and culturally rich language spoken by millions of people across the Tibetan Plateau and surrounding regions. Exploring the application of speech … |
JIAHAO YANG et. al. | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
720 | MUST: A Multilingual Student-Teacher Learning Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, the aforementioned limitation is addressed by proposing a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors mapping approach. |
Muhammad Umar Farooq; Rehan Ahmad; Thomas Hain; | arxiv-cs.CL | 2023-10-28 |
721 | A Review on Speech Recognition for Under-Resourced Languages: A Case Study of Vietnamese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Fundamental speech recognition technologies for high-resourced languages are currently successful to build high-quality applications with the use of deep learning models. However, … |
Trung-Nghia Phung; Duc-Binh Nguyen; Ngoc-Phuong Pham; | Int. J. Knowl. Syst. Sci. | 2023-10-27 |
722 | Evaluating A Fine-Tuned Whisper Model on Underrepresented Romanian Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech datasets available for training Romanian automatic speech recognition (ASR) systems are constructed around similar demographics (male voices, age between 19-29 years). In … |
V. Pais; V. Mititelu; Radu Ion; Elena Irimia; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
723 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Paweł Skórzewski; Marcin Sowański; Tomasz Ziętkiewicz; | arxiv-cs.CL | 2023-10-25 |
724 | A Comparative Analysis Between Conformer-Transducer, Whisper, and Wav2vec2 for Improving The Child Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Automatic Speech Recognition (ASR) systems have progressed significantly in their performance on adult speech data; however, transcribing child speech remains challenging due to … |
Andrei Barcovschi; Rishabh Jain; Peter Corcoran; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
725 | Dysarthric Speech Recognition Using Depthwise Separable Convolutions: Preliminary Study Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: As a neurological disability that affects muscles involved in articulation, dysarthria is a speech impairment that leads to reduced speech intelligibility. In severe cases, these … |
Seyed Reza Shahamiri; Krishnendu Mandal; Sudeshna Sarkar; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
726 | Uncovering Bias in ASR Systems: Evaluating Wav2vec2 and Whisper for Dutch Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: It is crucial that ASR systems can handle the wide range of variations in speech of speakers from different demographic groups, with different speaking styles, and of speakers … |
Márcio Fuckner; Sophie Horsman; Pascal Wiggers; Iskaj Janssen; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
727 | ArTST: Arabic Text and Speech Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. |
Hawau Olamide Toyin; Amirbek Djanibekov; Ajinkya Kulkarni; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-25 |
728 | Hypotheses Paradise: An Open and Strong Baseline for Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | nips | 2023-10-24 |
729 | CDSD: Chinese Dysarthria Speech Database Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the widespread use of Automatic Speech Recognition (ASR), accurately recognizing dysarthric speech remains a formidable task, largely due to the limited availability of dysarthric speech data. To address this gap, we developed the Chinese Dysarthria Speech Database (CDSD), the most extensive collection of Chinese dysarthria data to date, featuring 133 hours of recordings from 44 speakers. |
YAN WANG et. al. | arxiv-cs.SD | 2023-10-24 |
730 | How Much Context Does My Attention-Based ASR System Need? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. |
Robert Flynn; Anton Ragni; | arxiv-cs.CL | 2023-10-24 |
731 | Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. |
SARA PAPI et. al. | arxiv-cs.CL | 2023-10-23 |
732 | Intuitive Multilingual Audio-Visual Speech Recognition with A Single-Trained Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. |
Joanna Hong; Se Jin Park; Yong Man Ro; | arxiv-cs.MM | 2023-10-23 |
733 | Conversational Speech Recognition By Learning Audio-textual Cross-modal Contextual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. |
KUN WEI et. al. | arxiv-cs.SD | 2023-10-22 |
734 | Intelligibility Prediction with A Pretrained Noise-robust Automatic Speech Recognition Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction … |
Zehai Tu; Ning Ma; Jon Barker; | ArXiv | 2023-10-20 |
735 | BUT CHiME-7 System Description Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. |
MARTIN KARAFIÁT et. al. | arxiv-cs.SD | 2023-10-18 |
736 | VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (ASR) of Arabic. |
Abdul Waheed; Bashar Talafha; Peter Sullivan; AbdelRahim Elmadany; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2023-10-17 |
737 | Generative Error Correction for Code-switching Speech Recognition Using Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), … |
CHEN CHEN et. al. | ArXiv | 2023-10-17 |
738 | Correction Focused Language Model Training for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. |
Yingyi Ma; Zhe Liu; Ozlem Kalinli; | arxiv-cs.CL | 2023-10-17 |
739 | Multi-stage Large Language Model Correction for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. |
Jie Pu; Thai-Son Nguyen; Sebastian Stüker; | arxiv-cs.CL | 2023-10-17 |
740 | Noise-Robust Automatic Speech Recognition for Industrial and Urban Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) models can achieve human parity, but their performance degrades significantly when used in noisy industrial and urban environments. In this … |
Daniil Orel; H. A. Varol; | IECON 2023- 49th Annual Conference of the IEEE Industrial … | 2023-10-16 |
741 | Detecting Speech Abnormalities With A Perceiver-Based Sequence Classifier That Leverages A Universal Speech Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech … |
H. SOLTAU et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-10-16 |
742 | End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2023-10-16 |
743 | Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. |
ZHIHONG LEI et. al. | arxiv-cs.LG | 2023-10-15 |
744 | Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. |
Ankitha Sudarshan; Vinay Samuel; Parth Patwa; Ibtihel Amara; Aman Chadha; | arxiv-cs.CL | 2023-10-14 |
745 | SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. |
ZHEHUAI CHEN et. al. | arxiv-cs.CL | 2023-10-13 |
746 | On The Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. |
Nick Rossenbach; Benedikt Hilmes; Ralf Schlüter; | arxiv-cs.CL | 2023-10-12 |
747 | Adapting The Adapters for Code-switching in Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. |
Atharva Kulkarni; Ajinkya Kulkarni; Miguel Couceiro; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-11 |
748 | A Study of Speech Recognition, Speech Translation, and Speech Summarization of TED English Lectures Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Our research focuses on developing an automatic speech recognition system for English lectures, which involves summarizing the content and providing Japanese subtitles. Subtitling … |
Kazumasa Yamamoto; Haruhiko Banno; Haruki Sakurai; Toichiro Adachi; Seiichi Nakagawa; | 2023 IEEE 12th Global Conference on Consumer Electronics … | 2023-10-10 |
749 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | arxiv-cs.CL | 2023-10-10 |
750 | No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. |
Dennis Fucci; Marco Gaido; Matteo Negri; Mauro Cettolo; Luisa Bentivogli; | arxiv-cs.CL | 2023-10-10 |
751 | Acoustic Model Fusion for End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. |
ZHIHONG LEI et. al. | arxiv-cs.SD | 2023-10-10 |
752 | ToozKit: System for Experimenting with Captions on A Head-worn Display Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advent of Automatic Speech Recognition (ASR) has made real-time captioning for the Deaf and Hard-of-Hearing (DHH) community possible, and integration of ASR into Head-worn … |
Peter Feng; David Martin; Thad Starner; | Adjunct Proceedings of the 2023 ACM International Joint … | 2023-10-08 |
753 | Ed-cec: Improving Rare Word Recognition Using Asr Postprocessing Based on Error Detection and Context-aware Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. |
Jiajun He; Zekun Yang; Tomoki Toda; | arxiv-cs.AI | 2023-10-08 |
754 | Improving End-to-End Speech Processing By Efficient Text Data Utilization with Latent Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. |
JIANQIAO LU et. al. | arxiv-cs.CL | 2023-10-08 |
755 | LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. |
ZHIHAO DU et. al. | arxiv-cs.SD | 2023-10-06 |
756 | Dementia Assessment Using Mandarin Speech with An Attention-based Speech Recognition Encoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. |
ZIH-JYUN LIN et. al. | arxiv-cs.CL | 2023-10-05 |
757 | EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. |
Tejes Srivastava; Jiatong Shi; William Chen; Shinji Watanabe; | arxiv-cs.SD | 2023-10-05 |
758 | LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-end ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models. |
ALEKSANDR MEISTER et. al. | arxiv-cs.CL | 2023-10-04 |
759 | Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. |
Liming Wang; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2023-10-03 |
760 | Evaluating Speech Synthesis By Training Recognizers on Synthetic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. |
DAREEN ALHARTHI et. al. | arxiv-cs.CL | 2023-10-01 |
761 | AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. |
TOBI OLATUNJI et. al. | arxiv-cs.CL | 2023-09-30 |
762 | AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. |
Andrew Rouditchenko; Ronan Collobert; Tatiana Likhomanenko; | arxiv-cs.LG | 2023-09-29 |
763 | SLM: Bridge The Thin Gap Between Speech and Text Foundation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. |
MINGQIU WANG et. al. | arxiv-cs.CL | 2023-09-29 |
764 | The Gift of Feedback: Improving ASR Model Quality By Learning from User Corrections Through Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. |
LILLIAN ZHOU et. al. | arxiv-cs.CL | 2023-09-29 |
765 | LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. |
GUODONG MA et. al. | arxiv-cs.SD | 2023-09-28 |
766 | HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | arxiv-cs.CL | 2023-09-27 |
767 | Lip2Vec: Efficient and Robust Visual Speech Recognition Via Latent-to-Latent Visual to Audio Representation Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Un-like previous works that involve auxiliary losses or com-plex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. |
Yasser Abdelaziz Dahou Djilali; Sanath Narayan; Haithem Boussaid; Ebtessam Almazrouei; Merouane Debbah; | iccv | 2023-09-27 |
768 | Speech Collage: Code-switched Audio Generation By Collaging Monolingual Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. |
AMIR HUSSEIN et. al. | arxiv-cs.SD | 2023-09-27 |
769 | Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in The HYKIST Project Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In today’s interconnected globe, moving abroad is more and more prevalent, whether it’s for employment, refugee resettlement, or other causes. Language difficulties between … |
Khai Le-Duc; | ArXiv | 2023-09-26 |
770 | Updated Corpora and Benchmarks for Long-Form Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we re-release three standard ASR corpora – TED-LIUM 3, Gigapeech, and VoxPopuli-en – with updated transcription and alignments to enable their use for long-form ASR research. |
JENNIFER DREXLER FOX et. al. | arxiv-cs.CL | 2023-09-26 |
771 | Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform “non-verbatim” transcription, … |
DONGJI GAO et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-09-26 |
772 | Speech Dereverberation With Frequency Domain Autoregressive Modeling Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to … |
Anurenjan Purushothaman; Debottam Dutta; Rohit Kumar; Sriram Ganapathy; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2023-09-24 |
773 | A Survey of Automatic Speech Recognition Deep Models Performance for Polish Medical Terms Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Among the numerous applications of speech-to-text technology is the support of documentation created by medical personnel. There are many available speech recognition systems for … |
MARTA ZIELONKA et. al. | 2023 Signal Processing: Algorithms, Architectures, … | 2023-09-20 |
774 | AudioFool: Fast, Universal and Synchronization-free Cross-Domain Attack on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. |
Mohamad Fakih; Rouwaida Kanj; Fadi Kurdahi; Mohammed E. Fouda; | arxiv-cs.CR | 2023-09-20 |
775 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2023-09-19 |
776 | HypR: A Comprehensive Study for ASR Hypothesis Revising with A Reference Corpus Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Accordingly, we first concentrate on providing an ASR hypothesis revising (HypR) dataset in this study. |
Yi-Wei Wang; Ke-Han Lu; Kuan-Yu Chen; | arxiv-cs.CL | 2023-09-18 |
777 | Instruction-Following Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the mechanisms behind these models’ speech understanding and reasoning capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. |
Cheng-I Jeff Lai; Zhiyun Lu; Liangliang Cao; Ruoming Pang; | arxiv-cs.CL | 2023-09-18 |
778 | BIGOS – Benchmark Intended Grouping of Open Speech Corpora for Polish Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a Benchmark Intended Grouping of Open Speech (BIGOS), a new corpus designed for Polish Automatic Speech Recognition (ASR) systems. This initial version of the … |
Michał Junczyk; | 2023 18th Conference on Computer Science and Intelligence … | 2023-09-17 |
779 | Are Soft Prompts Good Zero-shot Learners for Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). |
DIANWEN NG et. al. | arxiv-cs.SD | 2023-09-17 |
780 | Open Vocabulary Keyword Spotting with Small-Footprint ASR-based Architecture and Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present the results of experiments on minimizing the model size for the text-based Open Vocabulary Keyword Spotting task. The main goal is to perform inference on devices with … |
Mikołaj Pudo; Mateusz Wosik; Artur Janicki; | 2023 18th Conference on Computer Science and Intelligence … | 2023-09-17 |
781 | Augmenting Conformers with Structured State-space Sequence Models for Online Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. |
HAOZHE SHAN et. al. | arxiv-cs.CL | 2023-09-15 |
782 | Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. |
YANG LI et. al. | arxiv-cs.LG | 2023-09-14 |
783 | Enabling Speech Recognition for Lesser-Known Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: An overview of Automatic Speech Recognition (ASR) technology, including its history and challenges in implementing it for regional languages. The challenges include lack of … |
AMIT JHA et. al. | 2023 6th International Conference on Contemporary Computing … | 2023-09-14 |
784 | Echotune: A Modular Extractor Leveraging The Variable-Length Nature of Speech in ASR Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. |
Sizhou Chen; Songyang Gao; Sen Fang; | arxiv-cs.SD | 2023-09-14 |
785 | CPPF: A Contextual and Post-processing-free Model for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. |
LEI ZHANG et. al. | arxiv-cs.CL | 2023-09-13 |
786 | SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. |
HAOXU WANG et. al. | arxiv-cs.SD | 2023-09-11 |
787 | SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches … |
HAOXU WANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-09-11 |
788 | Leveraging Large Language Models for Exploiting ASR Uncertainty IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. |
PRANAY DIGHE et. al. | arxiv-cs.CL | 2023-09-09 |
789 | Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. |
Huaibo Zhao; Yosuke Higuchi; Yusuke Kida; Tetsuji Ogawa; Tetsunori Kobayashi; | arxiv-cs.SD | 2023-09-08 |
790 | LanSER: Language-Model Supported Speech Emotion Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. |
TAESIK GONG et. al. | arxiv-cs.CL | 2023-09-07 |
791 | Bring The Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. |
Patrick Eickhoff; Matthias Möller; Theresa Pekarek Rosin; Johannes Twiefel; Stefan Wermter; | arxiv-cs.CL | 2023-09-05 |
792 | Text-Only Domain Adaptation for End-to-End Speech Recognition Through Down-Sampling Acoustic Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. |
JIAXU ZHU et. al. | arxiv-cs.SD | 2023-09-04 |
793 | Cochlear Filter-Based Cepstral Features for Dysarthric Severity-Level Classification Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Severity-level classification of dysarthria helps in diagnosing a patient and choosing an appropriate course of treatment. This would also aid in redirecting the speech to an … |
Siddharth Rathod; Priyanka Gupta; Aastha Kachhi; Hemant A. Patil; | 2023 31st European Signal Processing Conference (EUSIPCO) | 2023-09-04 |
794 | SememeASR: Boosting Performance of End-to-End Speech Recognition Against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). |
Jiaxu Zhu; Changhe Song; Zhiyong Wu; Helen Meng; | arxiv-cs.SD | 2023-09-04 |
795 | Room Adaptation of Training Data for Distant Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present a novel signal processing-based approach for estimating room impulse responses for augmentation of ASR training data that is best suited to the reverberation … |
James Fosburgh; D. Sharma; P. Naylor; | 2023 31st European Signal Processing Conference (EUSIPCO) | 2023-09-04 |
796 | Boosting Low-Resource Speech Recognition in Air Traffic Communication Via Pretrained Feature Aggregation and Multi-Task Learning IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Developing a robust Automatic Speech Recognition (ASR) system usually requires a large amount of well-annotated samples which is extremely hard to build in the Air Traffic Control … |
Dongyue Guo; Zichen Zhang; Bo Yang; Jianwei Zhang; Yi Lin; | IEEE Transactions on Circuits and Systems II: Express Briefs | 2023-09-01 |
797 | Utilizing Automatic Speech Recognition for English Pronunciation Practice and Analyzing Its Impact Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advancement of AI in recent years has been remarkable, along with the widespread use of speech recognition functions. In addition, an increasing number of people are … |
K. Umezawa; M. Nakazawa; Michiko Nakano; S. Hirasawa; | 2023 IEEE 12th International Conference on Engineering … | 2023-08-29 |
798 | ASTER: Automatic Speech Recognition System Accessibility Testing for Stutterers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenge, we propose ASTER, a technique for automatically testing the accessibility of ASR systems. |
YI LIU et. al. | arxiv-cs.SD | 2023-08-29 |
799 | Naaloss: Rethinking The Objective of Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. |
Kuan-Hsun Ho; En-Lun Yu; Jeih-weih Hung; Berlin Chen; | arxiv-cs.SD | 2023-08-24 |
800 | Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. |
YUCHEN HU et. al. | ijcai | 2023-08-23 |
801 | Convoifilter: A Case Study of Doing Cocktail Party Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.SD | 2023-08-22 |
802 | SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. |
SEAMLESS COMMUNICATION et. al. | arxiv-cs.CL | 2023-08-22 |
803 | An Enhanced Method for Dialect Transcription Via Error-correcting Thesaurus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) has been widely used in the field of customer service, but the performance of general ASR in dialect transcription is not satisfactory, … |
Xiaoliang Ma; Congjian Deng; Dequan Du; Qingqi Pei; | IET Commun. | 2023-08-21 |
804 | Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing Based Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View |
ZHENG LIANG et. al. | Interspeech | 2023-08-20 |
805 | Using Commercial ASR Solutions to Assess Reading Skills in Children: A Case Report Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Reading is an acquired skill that is essential for integrating and participating in today’s society. Yet, becoming literate can be particularly laborious for some children. … |
TIMOTHY PITON et. al. | Interspeech | 2023-08-20 |
806 | Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children’s Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children’s speech. One cause is the high … |
Christopher Gebauer; Lars Rumberg; Hanna Ehlert; Ulrike Lüdtke; Jörn Ostermann; | Interspeech | 2023-08-20 |
807 | LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With advances in deep learning methodologies, Automatic Speech Recognition (ASR) systems have seen impressive re-sults. However, ASR in Low-Resource Environments (LREs) are … |
Kavan Fatehi; Ayse Kucukyilmaz; | Interspeech | 2023-08-20 |
808 | Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper considers applying speaker diarization (SD) to the output tokens of automatic speech recognition (ASR). We formulate the task to be solved as a sequence classification … |
MIDIA YOUSEFI et. al. | Interspeech | 2023-08-20 |
809 | Improving The Response Timing Estimation for Spoken Dialogue Systems By Reducing The Effect of Speech Recognition Delay Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In conversational systems, the proper timing of the system’s response is critical to maintaining a comfortable conversation. To achieve appropriate timing estimation, it is … |
Jin Sakuma; S. Fujie; Huaibo Zhao; Tetsunori Kobayashi; | Interspeech | 2023-08-20 |
810 | A Conformer-based Classifier for Variable-length Utterance Processing in Anti-spoofing IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The success achieved by conformers in Automatic Speech Recognition (ASR) leads us to their application in other domains, such as spoofing detection for automatic speaker … |
Eros Rosello; Alejandro Gomez-Alanis; A. Gómez; A. Peinado; | Interspeech | 2023-08-20 |
811 | Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses effective pretraining of automatic speech recognition (ASR) and gender recognition to improve wav2vec 2.0 embedding for speech emotion recognition (SER). … |
Yuan Gao; Chenhui Chu; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
812 | On Training A Neural Residual Acoustic Echo Suppressor for Improved ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Acoustic Echo Cancellation (AEC) is critical for accurate recognition of speech directed at a smart device playing audio. Previous work has showed that neural AEC models can … |
S. Panchapagesan; T. Shabestary; A. Narayanan; | Interspeech | 2023-08-20 |
813 | Noise-Robust Bandwidth Expansion for 8K Speech Recordings Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recordings in call centers are narrowband and mixed with various noises. Developing a bandwidth expansion (BWE) model is important to mitigate the automated speech … |
YIN-TSE LIN et. al. | Interspeech | 2023-08-20 |
814 | A Neural Time Alignment Module for End-to-End Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end trainable (E2E) automatic speech recognition (ASR) systems have low word error rates, but they do not model timings or silence by default unlike hidden Markov model … |
Dongcheng Jiang; C. Zhang; P. Woodland; | Interspeech | 2023-08-20 |
815 | Joint Blind Source Separation and Dereverberation for Automatic Speech Recognition Using Delayed-Subsource MNMF with Localization Prior Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Overlapping speech and high room reverberation deteriorate the accuracy of automatic speech recognition (ASR). This paper proposes a method for jointly optimum source separation … |
Mieszko Fraś; Marcin Witkowski; K. Kowalczyk; | Interspeech | 2023-08-20 |
816 | Improving Joint Speech and Emotion Recognition Using Global Style Tokens Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) and speech emotion recognition (SER) are closely related in that the acoustic features of speech, such as pitch, tone, and intensity, can vary … |
Jehyun Kyung; Ju-Seok Seong; Jeonghwan Choi; Ye-Rin Jeoung; Joon‐Hyuk Chang; | Interspeech | 2023-08-20 |
817 | Adapter-tuning with Effective Token-dependent Representation Shift for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The use of self-supervised pre-trained speech models has greatly improved speech tasks in low-resource settings. However, fine-tuning the entire model can be computationally … |
DIANWEN NG et. al. | Interspeech | 2023-08-20 |
818 | Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper proposes autoregressive modeling of the joint multi-talker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a … |
Naoki Makishima; Keita Suzuki; Satoshi Suzuki; Atsushi Ando; Ryo Masumura; | Interspeech | 2023-08-20 |
819 | Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Knowledge about phonemes and their articulatory attributes can help improve automatic speech recognition (ASR) of low-resource languages. In this study, we propose a simple and … |
Jaeyoung Lee; M. Mimura; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
820 | Unsupervised Code-switched Text Generation from Parallel Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual … |
JI-EUN CHI et. al. | Interspeech | 2023-08-20 |
821 | Automatic Speaker Recognition with Variation Across Vocal Conditions: A Controlled Experiment with Implications for Forensics Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speaker Recognition (ASR) involves a complex range of processes to extract, model, and compare speaker-specific information from a pair of voice samples. Using heavily … |
VINCENT HUGHES et. al. | Interspeech | 2023-08-20 |
822 | Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We release 840 hours of read speech multi-dialect ASR corpora consisting of 700 hours of main Thai dialect, named Thai-central, and 40 hours for each local dialect , named … |
Artit Suwanbandit; Burin Naowarat; Orathai Sangpetch; E. Chuangsuwanich; | Interspeech | 2023-08-20 |
823 | ASR for Low Resource and Multilingual Noisy Code-Mixed Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Developing reliable Automatic Speech Recognition (ASR) sys-tem for Indian Languages has been challenging due to the limited availability of large-scale, high-quality speech … |
Tushar Verma; Atul Shree; Ashutosh Modi; | Interspeech | 2023-08-20 |
824 | Effective Training of Attention-based Contextual Biasing Adapters with Synthetic Audio for Personalised ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Contextual biasing (CB) is an effective approach for contex-tualising hidden features of neural transducer ASR models to improve rare word recognition. CB relies on relatively … |
Burin Naowarat; Philip Harding; Pasquale D’Alterio; Sibo Tong; Bashar Awwad Shiekh Hasan; | Interspeech | 2023-08-20 |
825 | I Learned Error, I Can Fix It! : A Detector-Corrector Structure for ASR Error Calibration Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recognition technology has improved recently. However, in the context of spoken language understanding (SLU), containing automatic speech recognition (ASR) errors causes … |
Heuiyeen Yeen; Minju Kim; M. Koo; | Interspeech | 2023-08-20 |
826 | Wav2vec 2.0 ASR for Cantonese-Speaking Older Adults in A Clinical Setting Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The lack of large-scale speech corpora for Cantonese and older adults has impeded the academia’s research of automatic speech recognition (ASR) systems for the two. On the other … |
Ranzo Huang; B. Mak; | Interspeech | 2023-08-20 |
827 | Speech-in-Speech Recognition Is Modulated By Familiarity to Dialect Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Listening to speech in competing background speech can be difficult due to elements such as the linguistic content of the signal. Linguistic release from masking occurs when … |
Jessica L. L. Chin; Elena Talevska; M. Antoniou; | Interspeech | 2023-08-20 |
828 | UniSplice: Universal Cross-Lingual Data Splicing for Low-Resource ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) automatic speech recognition (ASR) has made remarkable progress thanks to the abundant annotated data for a few rich-resource languages. However, data scarcity … |
Wei Wang; Y. Qian; | Interspeech | 2023-08-20 |
829 | Human Transcription Quality Improvement Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to … |
Jian Gao; Hanbo Sun; Cheng Cao; Zheng Du; | Interspeech | 2023-08-20 |
830 | Whisper Features for Dysarthric Severity-Level Classification Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Dysarthria is a speech disorder caused by improper coordination between the brain and the muscles that produce intelligible speech. Accurately diagnosing the severity of … |
Siddharth Rathod; Monil Charola; Akshat Vora; Yash Jogi; H. Patil; | Interspeech | 2023-08-20 |
831 | Silent Speech Recognition with Articulator Positions Estimated from Tongue Ultrasound and Lip Video Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present a multi-speaker silent speech recognition system trained on articulator features derived from the Tongue and Lips corpus, a multi-speaker corpus of ultrasound tongue … |
Rachel Beeson; Korin Richmond; | Interspeech | 2023-08-20 |
832 | An Improved End-to-End Audio-Visual Speech Recognition Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: By incorporating lip language, audio-visual speech recognition can effectively improve the recognition effect in noisy environments, and will slightly improve the recognition … |
Sheng Yang; Zheng Gong; Jiacang Kang; | Interspeech | 2023-08-20 |
833 | Automatic Speaker Recognition Performance with Matched and Mismatched Female Bilingual Speech Data Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Validation of forensic voice comparison methods requires testing using speech samples that are representative of forensic casework conditions. Increasingly, around the world, … |
Bryony Nuttall; Philip Harrison; Vincent Hughes; | Interspeech | 2023-08-20 |
834 | MiniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming ASR Applications on The Edge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Real-time applications of Automatic Speech Recognition (ASR) on user devices on the edge require streaming processing. Conformer model has achieved state-of-the-art performance in … |
Haris Gulzar; Monikka Roslianna Busto; Takeharu Eda; Katsutoshi Itoyama; K. Nakadai; | Interspeech | 2023-08-20 |
835 | Information Magnitude Based Dynamic Sub-sampling for Speech-to-text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Attention-based models have achieved new state-of-the-art in many tasks while the computational cost of these models increases drastically compared with previous methods. For most … |
YUHAO ZHANG et. al. | Interspeech | 2023-08-20 |
836 | Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) Automatic Speech Recognition (ASR) has gained popularity in recent years, with most research focusing on designing novel neural network architectures, speech … |
Zeyu Zhao; P. Bell; | Interspeech | 2023-08-20 |
837 | Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speakers with dysarthria could particularly benefit from assistive speech technology, but are underserved by current automatic speech recognition (ASR) systems. The differences of … |
Enno Hermann; Mathew Magimai; | Interspeech | 2023-08-20 |
838 | Dialect Speech Recognition Modeling Using Corpus of Japanese Dialects and Self-Supervised Learning-based Model XLSR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In order to utilize the large amount of historical speech resources for applications such as linguistic analysis and retrieval, automatic speech recognition technology that can … |
Shogo Miwa; A. Kai; | Interspeech | 2023-08-20 |
839 | TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TokenSplit, a speech separation model that acts on discrete token sequences. |
HAKAN ERDOGAN et. al. | arxiv-cs.SD | 2023-08-20 |
840 | Speech Emotion Recognition Using Decomposed Speech Via Multi-task Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In speech emotion recognition, most recent studies used powerful models to obtain robust features without considering the disentangled components, which contain diverse … |
Jia-Hao Hsu; C. Wu; Yunchao Wei; | Interspeech | 2023-08-20 |
841 | Exploring Sources of Racial Bias in Automatic Speech Recognition Through The Lens of Rhythmic Variation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although studies have shown that one issue of bias in modern automatic speech recognition (ASR) technologies is degraded performance for African American English (AAE) speakers, … |
Li-Fang Lai; N. Holliday; | Interspeech | 2023-08-20 |
842 | Domain Adaptive Self-supervised Training of Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper explores domain adaptive self-supervised training of automatic speech recognition (ASR). Unlabeled data from the target domain can either be used in training the … |
Cong-Thanh Do; R. Doddipatla; Mohan Li; Thomas Hain; | Interspeech | 2023-08-20 |
843 | Data Augmentation for Children ASR and Child-adult Speaker Classification Using Voice Conversion Methods Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Many young children prefer speech based interfaces over text, as they are relatively slow and error-prone with text input. However, children ASR can be challenging due to the lack … |
Shuyang Zhao; Mittul Singh; Abraham Woubie; Reima Karhila; | Interspeech | 2023-08-20 |
844 | Bayes Risk Transducer: Transducer with Controllable Alignment Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. |
JINCHUAN TIAN et. al. | arxiv-cs.CL | 2023-08-19 |
845 | Assessment of L2 Oral Proficiency Using Self-Supervised Speech Representation Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: A standard pipeline for automated spoken language assessment is to start with an automatic speech recognition (ASR) system and derive features that exploit transcriptions and … |
Stefano Bannò; K. Knill; M. Matassoni; Vyas Raina; M. Gales; | Slate | 2023-08-18 |
846 | Accurate Synthesis of Dysarthric Speech for ASR Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. |
Mohammad Soleymanpour; Michael T. Johnson; Rahim Soleymanpour; Jeffrey Berry; | arxiv-cs.SD | 2023-08-16 |
847 | Radio2Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, … |
Running Zhao; Luca Jiang-Tao Yu; H. Zhao; Edith C. H. Ngai; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2023-08-16 |
848 | An Ambient Intelligence-based Approach For Longitudinal Monitoring of Verbal and Vocal Depression Symptoms Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Another major challenge in depression relapse research is the scarcity of publicly available datasets. To overcome these issues, we propose a one-shot learning framework for detecting depression relapse from speech. |
Alice Othmani; Muhammad Muzammel; | arxiv-cs.HC | 2023-08-16 |
849 | Radio2Text: Streaming Speech Recognition Using MmWave Radio Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. |
Running Zhao; Jiangtao Yu; Hang Zhao; Edith C. H. Ngai; | arxiv-cs.SD | 2023-08-15 |
850 | Using Text Injection to Improve Recognition of Personal Identifiers in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. |
YOCHAI BLAU et. al. | arxiv-cs.CL | 2023-08-14 |
851 | Text Injection for Capitalization and Turn-Taking Prediction in Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. |
SHAAN BIJWADIA et. al. | arxiv-cs.CL | 2023-08-14 |
852 | A Novel Self-training Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. |
Satwinder Singh; Feng Hou; Ruili Wang; | arxiv-cs.CL | 2023-08-09 |
853 | Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). |
Yang Zhang; Krishna C. Puvvada; Vitaly Lavrukhin; Boris Ginsburg; | arxiv-cs.SD | 2023-08-09 |
854 | The Role of Audio Features in Accent Recognition: A Comparative Analysis Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study focuses on enhancing Automatic Speech Recognition (ASR) systems, crucial in Science and Technology, by addressing challenges tied to diverse speaker accents. The … |
Anik Biswas; | 2023 International Workshop on Intelligent Systems (IWIS) | 2023-08-09 |
855 | Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. |
JIAXIN FAN et. al. | arxiv-cs.CL | 2023-08-07 |
856 | Federated Representation Learning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. |
GURUPRASAD V RAMESH et. al. | arxiv-cs.SD | 2023-08-03 |
857 | Inaudible Adversarial Perturbation: Manipulating The Recognition of User Speech in Real Time Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. |
XINFENG LI et. al. | arxiv-cs.CR | 2023-08-02 |
858 | ÌròyìnSpeech: A Multi-purpose Yorùbá Speech Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. |
Tolulope Ogunremi; Kola Tubosun; Anuoluwapo Aremu; Iroro Orife; David Ifeoluwa Adelani; | arxiv-cs.CL | 2023-07-29 |
859 | The Timing Bottleneck: Why Timing and Overlap Are Mission-critical for Conversational User Interfaces, Speech Recognition and Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). |
Andreas Liesenfeld; Alianda Lopez; Mark Dingemanse; | arxiv-cs.CL | 2023-07-28 |
860 | Cascaded Cross-Modal Transformer for Request and Complaint Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our … |
Nicolae-Cătălin Ristea; Radu Tudor Ionescu; | Proceedings of the 31st ACM International Conference on … | 2023-07-27 |
861 | Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We discuss open problems and challenges with respect to modeling spoken information queries for virtual assistants, and list opportunities where Information Retrieval methods and research can be applied to improve the quality of virtual assistant speech recognition. |
Christophe Van Gysel; | sigir | 2023-07-25 |
862 | Boosting Punctuation Restoration with Data Generation and Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. |
VIET DAC LAI et. al. | arxiv-cs.CL | 2023-07-24 |
863 | Adaptation of Whisper Models to Child Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | ArXiv | 2023-07-24 |
864 | Code-Switched Urdu ASR for Noisy Telephonic Environment Using Data Centric Approach with Hybrid HMM and CNN-TDNN Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. |
Muhammad Danyal Khan; Raheem Ali; Arshad Aziz; | arxiv-cs.CL | 2023-07-24 |
865 | Robust Automatic Speech Recognition Via WavAugment Guided Phoneme Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). |
GEGE QI et. al. | arxiv-cs.SD | 2023-07-23 |
866 | Exploring The Integration of Speech Separation and Recognition with Self-Supervised Learning Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. |
YOSHIKI MASUYAMA et. al. | arxiv-cs.SD | 2023-07-23 |
867 | A Meta Learning Scheme for Fast Accent Domain Expansion in Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. |
Ziwei Zhu; Changhao Shan; Bihong Zhang; Jian Yu; | arxiv-cs.SD | 2023-07-23 |
868 | Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. |
SUYOUN KIM et. al. | arxiv-cs.CL | 2023-07-22 |
869 | A Change of Heart: Improving Speech Emotion Recognition Through Speech-to-Text Modality Conversion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset. |
Zeinab Sadat Taghavi; Ali Satvaty; Hossein Sameti; | arxiv-cs.SD | 2023-07-21 |
870 | A Deep Dive Into The Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. |
Anand Kumar Rai; Siddharth D Jaiswal; Animesh Mukherjee; | arxiv-cs.CL | 2023-07-20 |
871 | Room Acoustic Characterization with Smartphone-Based Automated Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Characterizing and monitoring the acoustic quality of a room is important for maintaining effective speech communication. Noise and echoes make speech harder to perceive, … |
Brady Laska; Bruce Wallace; Abagael Hudak; Rafik Goubran; | 2023 IEEE Sensors Applications Symposium (SAS) | 2023-07-18 |
872 | Ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: We introduceivrit.ai, a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) … |
Yanir Marmor; Kinneret Misgav; Y. Lifshitz; | ArXiv | 2023-07-17 |
873 | Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. |
Theresa Pekarek Rosin; Stefan Wermter; | arxiv-cs.CL | 2023-07-14 |
874 | SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. |
DOHYUN KIM et. al. | arxiv-cs.RO | 2023-07-14 |
875 | SGGNet2: Speech-Scene Graph Grounding Network for Speech-guided Navigation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately … |
DOHYUN KIM et. al. | 2023 32nd IEEE International Conference on Robot and Human … | 2023-07-14 |
876 | Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. |
He Huang; Jagadeesh Balam; Boris Ginsburg; | arxiv-cs.CL | 2023-07-13 |
877 | Exploring The Integration of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. |
Zeping Min; Jinbo Wang; | arxiv-cs.CL | 2023-07-12 |
878 | Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a computation-efficient network named Language-Routing Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. |
Wenxuan Wang; Guodong Ma; Yuke Li; Binbin Du; | arxiv-cs.SD | 2023-07-12 |
879 | SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Bhattacharya; | arxiv-cs.CL | 2023-07-12 |
880 | The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. |
KUN SONG et. al. | arxiv-cs.SD | 2023-07-10 |
881 | A Theory of Unsupervised Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we proposed a general theoretical framework to study the properties of {pasted macro �ASRU�}/ systems based on random matrix theory and the theory of neural tangent kernels. |
Liming Wang; Mark Hasegawa-Johnson; Chang Yoo; | acl | 2023-07-08 |
882 | Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). |
Martijn Bartelds; Nay San; Bradley McDonnell; Dan Jurafsky; Martijn Wieling; | acl | 2023-07-08 |
883 | Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. |
YUN TANG et. al. | acl | 2023-07-08 |
884 | DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. |
SURAJ KOTHAWADE et. al. | acl | 2023-07-08 |
885 | BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. |
MINGDA CHEN et. al. | acl | 2023-07-08 |
886 | Introducing Semantics Into Speech Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. |
DEREK XU et. al. | acl | 2023-07-08 |
887 | STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present STT4SG-350, a corpus of Swiss German speech, annotated with Standard German text at the sentence level. |
MICHEL PL�SS et. al. | acl | 2023-07-08 |
888 | Why Aren�t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we examine in detail the complex relationship between ASR and NER errors which limit the ability of NER models to recover entity mentions from spontaneous speech transcripts. |
PIOTR SZYMANSKI et. al. | acl | 2023-07-08 |
889 | Back Translation for Speech-to-text Translation Without Transcripts IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. |
Qingkai Fang; Yang Feng; | acl | 2023-07-08 |
890 | Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. |
SARA PAPI et. al. | arxiv-cs.CL | 2023-07-06 |
891 | NLP Based Model to Convert English Speech to Gujarati Text for Deaf & Dumb People Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: NLP (Natural Language Processing) is a use for process Hu-man Language. It is a combination of language and Artificial Intelligence. This paper is helpful to understand about … |
Nasrin Aasofwala; Shanti Verma; Kalyani Patel; | 2023 14th International Conference on Computing … | 2023-07-06 |
892 | Transcribing Educational Videos Using Whisper: A Preliminary Study on Using AI for Transcribing Educational Videos Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Videos are increasingly being used for e-learning, and transcripts are vital to enhance the learning experience. The costs and delays of generating transcripts can be alleviated … |
Ashwin Rao; | ArXiv | 2023-07-04 |
893 | Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input. |
Guangzhi Sun; Chao Zhang; Ivan Vulić; Paweł Budzianowski; Philip C. Woodland; | arxiv-cs.CL | 2023-07-04 |
894 | Boosting Norwegian Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. |
Javier de la Rosa; Rolv-Arild Braaten; Per Egil Kummervold; Freddy Wetjen; Svein Arne Brygfjeld; | arxiv-cs.CL | 2023-07-04 |
895 | Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. |
Tanvina Patel; Odette Scharenborg; | arxiv-cs.CL | 2023-07-04 |
896 | Using Open-Source Automatic Speech Recognition Tools for The Annotation of Dutch Infant-Directed Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: There is a large interest in the annotation of speech addressed to infants. Infant-directed speech (IDS) has acoustic properties that might pose a challenge to automatic speech … |
Anika van der Klis; Frans Adriaans; Mengru Han; R. Kager; | Multimodal Technol. Interact. | 2023-07-03 |
897 | Speech Topic Classification Based on Pre-trained and Graph Networks Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech Topic Classification (STC) automatically classifies audio clips into predefined categories, which is widely used in short video, personalized recommendation and other … |
Fangjing Niu; Tengfei Cao; Ying Hu; Hao Huang; Liang He; | 2023 IEEE International Conference on Multimedia and Expo … | 2023-07-01 |
898 | Don’t Stop Self-Supervision: Accent Adaptation of Speech Representations Via Residual Adapters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. |
ANSHU BHATIA et. al. | arxiv-cs.CL | 2023-07-01 |
899 | Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. |
Simone Wills; Yu Bai; Cristian Tejedor-Garcia; Catia Cucchiarini; Helmer Strik; | arxiv-cs.CL | 2023-06-29 |
900 | Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). |
Yuang Li; Yu Wu; Jinyu Li; Shujie Liu; | arxiv-cs.CL | 2023-06-28 |
901 | Accelerating Transducers Through Adjacent Token Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. |
Yuang Li; Yu Wu; Jinyu Li; Shujie Liu; | arxiv-cs.CL | 2023-06-28 |
902 | Cascaded Encoders for Fine-tuning ASR Models on Overlapped Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. |
Richard Rose; Oscar Chang; Olivier Siohan; | arxiv-cs.SD | 2023-06-28 |
903 | Don’t Be So Sure! Boosting ASR Decoding Via Confidence Relaxation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We perform a layer analysis to reveal and visualize how predictions evolve, and propose a decoding procedure that improves the performance of fine-tuned ASR models. |
Tomer Wullach; Shlomo E. Chazan; | aaai | 2023-06-26 |
904 | Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in developing visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). |
Minsu Kim; Chae Won Kim; Yong Man Ro; | aaai | 2023-06-26 |
905 | Performance Disparities Between Accents in Automatic Speech Recognition (Student Abstract) Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this work, we expand the discussion of bias in Automatic Speech Recognition (ASR) through a large-scale audit. Using a large and global data set of speech, we perform an audit … |
Alex DiChristofano; Henry Shuster; Shefali Chandra; Neal Patwari; | AAAI Conference on Artificial Intelligence | 2023-06-26 |
906 | Complex Dynamic Neurons Improved Spiking Transformer Network for Efficient Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we introduce four types of neuronal dynamics to post-process the sequential patterns generated from the spiking transformer to get the complex dynamic neuron improved spiking transformer neural network (DyTr-SNN). |
QINGYU WANG et. al. | aaai | 2023-06-26 |
907 | An Analysis of Personalized Speech Recognition System Development for The Deaf and Hard-of-Hearing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models typically do not perform well on DHH speech, we provide a thorough analysis of creating personalized ASR systems. |
Lester Phillip Violeta; Tomoki Toda; | arxiv-cs.SD | 2023-06-24 |
908 | Mixture Encoder for Joint Speech Separation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes a middle-ground approach that leverages explicit speech separation similarly to the modular approach but also incorporates mixture speech information directly into the ASR module in order to mitigate the propagation of errors made by the speech separator. |
Simon Berger; Peter Vieting; Christoph Boeddeker; Ralf Schlüter; Reinhold Haeb-Umbach; | arxiv-cs.CL | 2023-06-21 |
909 | NoRefER: A Referenceless Quality Metric for Automatic Speech Recognition Via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces NoRefER, a novel referenceless quality metric for automatic speech recognition (ASR) systems. |
Kamer Ali Yuksel; Thiago Ferreira; Golara Javadi; Mohamed El-Badrashiny; Ahmet Gunduz; | arxiv-cs.CL | 2023-06-21 |
910 | Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. |
Xuefei Wang; Yanhua Long; Yijie Li; Haoran Wei; | arxiv-cs.SD | 2023-06-20 |
911 | Improved Keyword Recognition Based on Aho-Corasick Automaton Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The recognition of out-of-vocabulary (OOV) words in many state-of-art automatic speech recognition (ASR) systems, which need the to recognize a word that has never been seen … |
Yachao Guo; Zhibin Qiu; Hao Huang; Chng Eng Siong; | 2023 International Joint Conference on Neural Networks … | 2023-06-18 |
912 | A Comparative Analysis of Automatic Speech Recognition Errors in Small Group Classroom Discourse IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In collaborative learning environments, effective intelligent learning systems need to accurately analyze and understand the collaborative discourse between learners (i.e., group … |
JIE CAO et. al. | Proceedings of the 31st ACM Conference on User Modeling, … | 2023-06-18 |
913 | Research on An Improved Conformer End-to-end Speech Recognition Model with R-Drop Structure Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the issue of poor generalization ability in end-to-end speech recognition models within deep learning, this study proposes a new Conformer-based speech recognition model called Conformer-R that incorporates the R-drop structure. |
Weidong Ji; Shijie Zan; Guohui Zhou; Xu Wang; | arxiv-cs.SD | 2023-06-14 |
914 | Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing Based Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the current data augmentation methods mainly rely on audio splicing and text-to-speech (TTS) models, which might result in discontinuous, unrealistic, and less diversified speech. To mitigate these potential issues, we propose a novel data augmentation method by applying the text-based speech editing model. |
ZHENG LIANG et. al. | arxiv-cs.CL | 2023-06-14 |
915 | Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, a novel multilingual model fusion technique has been proposed where a model is trained to learn cross-lingual acoustic-phonetic similarities as a mapping function. |
Muhammad Umar Farooq; Thomas Hain; | arxiv-cs.CL | 2023-06-14 |
916 | Performance of Speech Recognition Algorithms in Musical Speech Used for Speech-Language Pathology Rehabilitation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Musical speech in speech-language pathology rehabilitation is the production of speech following simple musical (rhythmic or melodic) patterns. This type of speech is used to … |
Pedram Aliniaye Asli; A. Zumbansen; | 2023 IEEE International Symposium on Medical Measurements … | 2023-06-14 |
917 | IIITH-CSTD Corpus: Crowdsourced Strategies for The Collection of A Large-scale Telugu Speech Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic … |
MIRISHKAR SAI GANESH et. al. | ACM Transactions on Asian and Low-Resource Language … | 2023-06-12 |
918 | Multimodal Audio-textual Architecture for Robust Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. |
Anderson R. Avila; Mehdi Rezagholizadeh; Chao Xing; | arxiv-cs.CL | 2023-06-11 |
919 | Adversarial Training For Low-Resource Disfluency Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) that utilizes a small amount of labeled real disfluent data in conjunction with a large amount of unlabeled data. |
Vineet Bhat; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2023-06-10 |
920 | Developing Speech Processing Pipelines for Police Accountability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate the potential of large pre-trained speech models for facilitating reviews, focusing on ASR and officer speech detection in footage from traffic stops. |
Anjalie Field; Prateek Verma; Nay San; Jennifer L. Eberhardt; Dan Jurafsky; | arxiv-cs.CL | 2023-06-09 |
921 | Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many … |
Xianzhao Chen; Yist Y. Lin; Kang Wang; Yi He; Zejun Ma; | ArXiv | 2023-06-09 |
922 | Latent Phrase Matching for Dysarthric Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Many consumer speech recognition systems are not tuned for people with speech disabilities, resulting in poor recognition and user experience, especially for severe speech … |
COLIN S. LEA et. al. | ArXiv | 2023-06-08 |
923 | An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In a previous study, we presented an ASR-based Dutch reading tutor application that was developed to provide instantaneous feedback to first-graders learning to read. |
Yu Bai; Cristian Tejedor-Garcia; Ferdy Hubers; Catia Cucchiarini; Helmer Strik; | arxiv-cs.CL | 2023-06-07 |
924 | Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new lenient evaluation metric as a more defensible CER measure for Japanese ASR. |
Shigeki Karita; Richard Sproat; Haruko Ishikawa; | arxiv-cs.CL | 2023-06-07 |
925 | Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to improve the performance of Arabic dysarthric automatic speech recognition through a multi-stage augmentation approach. |
Massa Baali; Ibrahim Almakky; Shady Shehata; Fakhri Karray; | arxiv-cs.SD | 2023-06-07 |
926 | Label Aware Speech Representation Learning For Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. |
SHIKHAR VASHISHTH et. al. | arxiv-cs.CL | 2023-06-07 |
927 | A Study on The Impact of Self-Supervised Learning on Automatic Dysarthric Speech Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7\%, +61\%, \text{and} +7.2\%$ accuracy compared to classical acoustic features. |
Xavier F. Cadet; Ranya Aloufi; Sara Ahmadi-Abhari; Hamed Haddadi; | arxiv-cs.CL | 2023-06-07 |
928 | Alzheimer Disease Classification Through ASR-based Transcriptions: Exploring The Impact of Punctuation and Pauses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we used the new state-of-the-art Automatic Speech Recognition (ASR) model Whisper to obtain the transcriptions, which also include automatic punctuation. |
LUCÍA GÓMEZ-ZARAGOZÁ et. al. | arxiv-cs.CL | 2023-06-06 |
929 | N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it is not clear how Whisper would fare under diverse conditions even on languages it was evaluated on such as Arabic. In this work, we address this gap by comprehensively evaluating Whisper on several varieties of Arabic speech for the ASR task. |
Bashar Talafha; Abdul Waheed; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2023-06-05 |
930 | SpellMapper: A Non-autoregressive Neural Spellchecker for ASR Customization with Candidate Retrieval Based on N-gram Mappings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose: 1) a novel algorithm for candidate retrieval, based on misspelled n-gram mappings, which gives up to 90% recall with just the top 10 candidates on Spoken Wikipedia; 2) a non-autoregressive neural model based on BERT architecture, where the initial transcript and ten candidates are combined into one input. |
Alexandra Antonova; Evelina Bakhturina; Boris Ginsburg; | arxiv-cs.CL | 2023-06-04 |
931 | End-to-End Joint Target and Non-Target Speakers ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker’s speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. |
RYO MASUMURA et. al. | arxiv-cs.CL | 2023-06-04 |
932 | A Reference-Less Quality Metric for Automatic Speech Recognition Via Contrastive-Learning of A Multi-Language Model with Self-Supervision Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual … |
K. Yuksel; Thiago Castro Ferreira; Ahmet Gunduz; Mohamed Al-Badrashiny; Golara Javadi; | 2023 IEEE International Conference on Acoustics, Speech, … | 2023-06-04 |
933 | Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limited availability of non-native speech datasets presents a major challenge in automatic speech recognition (ASR) to narrow the performance gap between native and non-native speakers. To address this, the focus of this study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. |
Jisung Wang; Haram Lee; Myungwoo Oh; | arxiv-cs.CL | 2023-06-04 |
934 | Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Combining several active learning paradigms and the core-set approach, we propose a new multi-rounds adaptation process that uses epistemic uncertainty to automate the annotation process, significantly reducing the associated costs and human labor. |
Bonaventure F. P. Dossou; | arxiv-cs.CL | 2023-06-03 |
935 | Adaptation and Optimization of Automatic Speech Recognition (ASR) for The Maritime Domain in The Field of VHF Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-cation that automatically converts received VHF radio signals into text. |
Emin Cagatay Nakilcioglu; Maximilian Reimann; Ole John; | arxiv-cs.SD | 2023-06-01 |
936 | Inspecting Spoken Language Understanding from Kids for Basic Math Learning at Home Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for Kid Space, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. |
Eda Okur; Roddy Fuentes Alba; Saurav Sahay; Lama Nachman; | arxiv-cs.CY | 2023-06-01 |
937 | SlothSpeech: Denial-of-service Attack Against Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose SlothSpeech, a denial-of-service attack against ASR models, which exploits the dynamic behaviour of the model. |
MIRAZUL HAQUE et. al. | arxiv-cs.SD | 2023-06-01 |
938 | Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in … |
DONGJI GAO et. al. | ArXiv | 2023-06-01 |
939 | Towards Hate Speech Detection in Low-resource Languages: Comparing ASR to Acoustic Word Embeddings on Wolof and Swahili Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We specifically use a multilingual AWE model trained on labelled data from well-resourced languages to spot keywords in data in the unseen target language. |
Christiaan Jacobs; Nathanaël Carraz Rakotonirina; Everlyn Asiko Chimoto; Bruce A. Bassett; Herman Kamper; | arxiv-cs.CL | 2023-06-01 |
940 | Strategies for Improving Low Resource Speech to Text Translation Relying on Pre-trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). |
Santosh Kesiraju; Marek Sarvas; Tomas Pavlicek; Cecile Macaire; Alejandro Ciuba; | arxiv-cs.CL | 2023-05-31 |
941 | The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, new approaches are explored and compared to improve the performance of CLS based multilingual ASR model. |
Kaousheik Jayakumar; Vrunda N. Sukhadia; A Arunkumar; S. Umesh; | arxiv-cs.CL | 2023-05-31 |
942 | Zero-Shot Automatic Pronunciation Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT. |
Hongfu Liu; Mingqian Shi; Ye Wang; | arxiv-cs.SD | 2023-05-31 |
943 | Accurate and Structured Pruning for Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance. |
HUIQIANG JIANG et. al. | arxiv-cs.CL | 2023-05-31 |
944 | Towards Selection of Text-to-speech Data to Augment ASR Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic … |
SHUO LIU et. al. | ArXiv | 2023-05-30 |
945 | STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present STT4SG-350 (Speech-to-Text for Swiss German), a corpus of Swiss German speech, annotated with Standard German text at the sentence level. |
MICHEL PLÜSS et. al. | arxiv-cs.CL | 2023-05-30 |
946 | Exploration of Efficient End-to-End ASR Using Discretized Input from Self-Supervised Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new protocol that utilizes discretized token sequences in ASR tasks, which includes de-duplication and sub-word modeling to enhance the input sequence. |
Xuankai Chang; Brian Yan; Yuya Fujita; Takashi Maekaku; Shinji Watanabe; | arxiv-cs.SD | 2023-05-29 |
947 | HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. |
Florian Mai; Juan Zuluaga-Gomez; Titouan Parcollet; Petr Motlicek; | arxiv-cs.CL | 2023-05-29 |
948 | CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish). |
Juan Zuluaga-Gomez; Sara Ahmed; Danielius Visockas; Cem Subakan; | arxiv-cs.CL | 2023-05-29 |
949 | Improving Textless Spoken Language Understanding with Discrete Units As Intermediate Target Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, inspired by the content-disentangled discrete units from self-supervised speech models, we proposed to use discrete units as intermediate guidance to improve textless SLU performance. |
Guan-Wei Wu; Guan-Ting Lin; Shang-Wen Li; Hung-yi Lee; | arxiv-cs.CL | 2023-05-29 |
950 | Speech and Noise Dual-stream Spectrogram Refine Network with Speech Distortion Loss for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. |
HAOYU LU et. al. | arxiv-cs.SD | 2023-05-28 |
951 | Retraining-free Customized ASR for Enharmonic Words Based on A Named-Entity-Aware Model and Phoneme Similarity Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. |
Yui Sudo; Kazuya Hata; Kazuhiro Nakadai; | arxiv-cs.SD | 2023-05-28 |
952 | Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on The False Alarms in Automated Speech Recognition Testing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. |
JULIA KAIWEN LAU et. al. | arxiv-cs.SE | 2023-05-27 |
953 | DisfluencyFixer: A Tool to Enhance Language Learning Through Speech To Speech Disfluency Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents DisfluencyFixer, a tool that performs speech-to-speech disfluency correction in English and Hindi using a pipeline of Automatic Speech Recognition (ASR), Disfluency Correction (DC) and Text-To-Speech (TTS) models. |
Vineet Bhat; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2023-05-26 |
954 | INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems have attained unprecedented performance with large speech models pre-trained based on self-supervised speech representation learning. … |
Eunseop Yoon; Hee Suk Yoon; John Harvill; M. Hasegawa-Johnson; C. Yoo; | Annual Meeting of the Association for Computational … | 2023-05-25 |
955 | Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with A Sidecar Separator IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2023-05-25 |
956 | Svarah: Evaluating English ASR Systems on Indian Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. In this work, we address this gap by creating Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 geographic locations throughout India, resulting in a diverse range of accents. |
TAHIR JAVED et. al. | arxiv-cs.CL | 2023-05-25 |
957 | Iteratively Improving Speech Recognition and Voice Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel iterative way of improving both the ASR and VC models. |
Mayank Kumar Singh; Naoya Takahashi; Onoe Naoyuki; | arxiv-cs.SD | 2023-05-24 |
958 | InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. |
ZHI-HAO LAI et. al. | arxiv-cs.CL | 2023-05-24 |
959 | Evaluating OpenAI’s Whisper ASR for Punctuation Prediction and Topic Modeling of Life Histories of The Museum of The Person Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. |
LUCAS RAFAEL STEFANEL GRIS et. al. | arxiv-cs.CL | 2023-05-23 |
960 | SE-Bridge: Speech Enhancement with Consistent Brownian Bridge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SE-Bridge, a novel method for speech enhancement (SE). |
Zhibin Qiu; Mengfan Fu; Fuchun Sun; Gulila Altenbek; Hao Huang; | arxiv-cs.SD | 2023-05-23 |
961 | Personalized Predictive ASR for Latency Reduction in Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. |
Andreas Schwarz; Di He; Maarten Van Segbroeck; Mohammed Hethnawi; Ariya Rastrow; | arxiv-cs.CL | 2023-05-23 |
962 | Text Generation with Speech Synthesis for ASR Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work … |
ZHUANGQUN HUANG et. al. | ArXiv | 2023-05-22 |
963 | GNCformer Enhanced Self-attention for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting the ESA is also worth being explored.In this paper, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named GNCformer. |
J. Li; Z. Duan; S. Li; X. Yu; G. Yang; | arxiv-cs.SD | 2023-05-22 |
964 | Self-supervised Representations in Speech-based Depression Detection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). |
Wen Wu; Chao Zhang; Philip C. Woodland; | arxiv-cs.CL | 2023-05-20 |
965 | A Comparative Study on E-Branchformer Vs Conformer in Speech Recognition, Translation, and Understanding Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models. |
YIFAN PENG et. al. | arxiv-cs.CL | 2023-05-18 |
966 | A Lexical-aware Non-autoregressive Transformer-based ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A series of experiments are conducted on the AISHELL-1, CSJ, and TEDLIUM 2 datasets. |
Chong-En Lin; Kuan-Yu Chen; | arxiv-cs.CL | 2023-05-18 |
967 | FunASR: A Fundamental End-to-End Speech Recognition Toolkit IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. |
ZHIFU GAO et. al. | arxiv-cs.SD | 2023-05-18 |
968 | AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present AVFormer, a simple method for augmenting audioonly models with visual information, at the same time performing lightweight domain adaptation. |
Paul Hongsuck Seo; Arsha Nagrani; Cordelia Schmid; | cvpr | 2023-05-17 |
969 | MmMIC: Multi-modal Speech Recognition Based on MmWave Radar IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With the proliferation of voice assistants, microphone-based speech recognition technology usually cannot achieve good performance in the situation of multiple sound sources and … |
LONG FAN et. al. | IEEE INFOCOM 2023 – IEEE Conference on Computer … | 2023-05-17 |
970 | OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, … |
FAZLE RAKIB et. al. | ArXiv | 2023-05-15 |
971 | Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. |
Weiwei Lin; Chenhang He; Man-Wai Mak; Youzhi Tu; | arxiv-cs.SD | 2023-05-14 |
972 | Investigating The Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. |
Emma O’Neill; Julie Carson-Berndsen; | arxiv-cs.CL | 2023-05-12 |
973 | Development of Low-Latency and Real-Time Filipino Children Automatic Speech Recognition System Using Deep Neural Network Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: As no studies have been made yet to use real-time speakers in performing the Filipino children automatic speech recognition system, this study aims to use the available children … |
Bonry Dorado; Alonica R. Villanueva; | 2023 11th International Symposium on Digital Forensics and … | 2023-05-11 |
974 | Multi-Temporal Lip-Audio Memory for Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. |
Jeong Hun Yeo; Minsu Kim; Yong Man Ro; | arxiv-cs.CV | 2023-05-08 |
975 | Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. |
YUN TANG et. al. | arxiv-cs.CL | 2023-05-04 |
976 | Edge Computing Solutions Supporting Voice Recognition Services for Speakers with Dysarthria Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In the framework of Automatic Speech Recognition (ASR), the synergism between edge computing and artificial intelligence has led to the development of intelligent objects that … |
Davide Mulfari; Lorenzo Carnevale; A. Galletta; M. Villari; | 2023 IEEE/ACM 23rd International Symposium on Cluster, … | 2023-05-01 |
977 | TrojanModel: A Practical Trojan Attack Against Automatic Speech Recognition Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While deep learning techniques have achieved great success in modern digital products, researchers have shown that deep learning models are susceptible to Trojan attacks. In a … |
W. Zong; Yang-Wai Chow; Willy Susilo; Kien Do; S. Venkatesh; | 2023 IEEE Symposium on Security and Privacy (SP) | 2023-05-01 |
978 | Building A Non-native Speech Corpus Featuring Chinese-English Bilingual Children: Compilation and Rationale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a non-native speech corpus consisting of narratives from fifty 5- to 6-year-old Chinese-English children. |
Hiuchung Hung; Andreas Maier; Thorsten Piske; | arxiv-cs.CL | 2023-04-30 |
979 | Enhancing Multilingual Speech Recognition in Air Traffic Control By Sentence-level Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a two-stage multilingual ASR framework. |
Peng Fan; Dongyue Guo; JianWei Zhang; Bo Yang; Yi Lin; | arxiv-cs.SD | 2023-04-29 |
980 | HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit Bert for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose HuBERT-AGG, a novel method that learns noise-invariant SSL representations for robust speech recognition by distilling aggregated layer-wise representations. |
W. Wang; Y. Qian; | icassp | 2023-04-27 |
981 | DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. |
H. WANG et. al. | icassp | 2023-04-27 |
982 | Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel modeling framework for effective training of end-to-end automatic speech recognition (ASR) models on various sources of data from diverse domains: speech paired with clean ground truth transcripts, speech with noisy pseudo transcripts from semi-supervised decodes and unpaired text-only data. |
T. Fukuda; S. Thomas; | icassp | 2023-04-27 |
983 | Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores a series of approaches to integrate domain adapted Self-Supervised Learning (SSL) pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. |
S. HU et. al. | icassp | 2023-04-27 |
984 | Conversation-Oriented ASR with Multi-Look-Ahead CBS Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. |
H. ZHAO et. al. | icassp | 2023-04-27 |
985 | Continual Learning for On-Device Speech Recognition Using Disentangled Conformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This algorithm produces ASR models consisting of a frozen ‘core’ network for general-purpose use and several tunable ‘augment’ networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. |
A. DIWAN et. al. | icassp | 2023-04-27 |
986 | Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a speech summarization system that enables E2E summarization from 100 seconds, which is the limit of the conventional method, to up to 10 minutes (i.e., the duration of typical instructional videos on YouTube). |
T. KANO et. al. | icassp | 2023-04-27 |
987 | The Edinburgh International Accents of English Corpus: Towards The Democratization of English ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). |
R. SANABRIA et. al. | icassp | 2023-04-27 |
988 | Stabilising and Accelerating Light Gated Recurrent Units for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the unbounded nature of its rectified linear unit on the candidate recurrent gate induces a gradient exploding phenomenon disrupting the training process and preventing it from being applied to medium to large ASR datasets. In this paper, we theoretically and empirically derive the necessary conditions for its stability as well as engineering mechanisms to speed up by a factor of five its training time, hence introducing a novel version of this architecture named SLi-GRU. |
A. Moumen; T. Parcollet; | icassp | 2023-04-27 |
989 | Joint Unsupervised and Supervised Learning for Context-Aware Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. |
J. PARK et. al. | icassp | 2023-04-27 |
990 | Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The research community has produced many successful self-supervised speech representation learning methods over the past few years. |
A. ELKAHKY et. al. | icassp | 2023-04-27 |
991 | Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. |
S. Sadhu; H. Hermansky; | icassp | 2023-04-27 |
992 | Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes Virtuoso, a massively multilingual speech–text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. |
T. SAEKI et. al. | icassp | 2023-04-27 |
993 | A Sidecar Separator Can Convert A Single-Talker Speech Recognition System to A Multi-Talker One IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains … |
L. MENG et. al. | icassp | 2023-04-27 |
994 | Vararray Meets T-Sot: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. |
N. KANDA et. al. | icassp | 2023-04-27 |
995 | Enhancing Unsupervised Speech Recognition with Diffusion GANS Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We enhance the vanilla adversarial training method for unsupervised Automatic Speech Recognition (ASR) by a diffusionGAN. |
X. Wu; | icassp | 2023-04-27 |
996 | Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. |
F. WU et. al. | icassp | 2023-04-27 |
997 | Structured State Space Decoder for Speech Recognition and Synthesis IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks, respectively, by comparing it with the Transformer decoder. |
K. Miyazaki; M. Murata; T. Koriyama; | icassp | 2023-04-27 |
998 | Domain Adaptation with External Off-Policy Acoustic Catalogs for Scalable Contextual End-to-End Automated Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy generated text-to-speech key-value stores, to allow for flexible post-training adaptation to new data distributions. |
D. M. Chan; S. Ghosh; A. Rastrow; B. Hoffmeister; | icassp | 2023-04-27 |
999 | Improving Speech-to-Speech Translation Through Unlabeled Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an effective way to utilize the massive existing unlabeled text from different languages to create a large amount of S2ST data to improve S2ST performance by applying various acoustic effects to the generated synthetic data. |
X. -P. NGUYEN et. al. | icassp | 2023-04-27 |
1000 | Adaptive Multi-Corpora Language Model Training for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel adaptive multi-corpora training algorithm that dynamically learns and adjusts the sampling probability of each corpus along the training process. |
Y. Ma; Z. Liu; X. Zhang; | icassp | 2023-04-27 |