Paper Digest: Recent Papers on Speech Recognition
Paper Digest Team extracted all recent Speech Recognition related papers on our radar, and generated highlight sentences for them. The results are then sorted by relevance & date. In addition to this ‘static’ page, we also provide a real-time version of this article, which has more coverage and is updated in real time to include the most recent updates on this topic.
Since 2018, Paper Digest has built a foundation of data spanning decades of conferences, journals, and research topics. The platform features a daily digest service that sifts through tens of thousands of new papers, clinical trials, news articles, and community posts, filtering the noise to highlight what matters most to specific interests. Beyond daily updates, dozens of built-in research tools streamline the academic workflow, supporting efficient reading and writing, comprehensive literature reviews, and automated research report generation.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: Recent Papers on Speech Recognition
| Paper | Author(s) | Source | Date | |
|---|---|---|---|---|
| 1 | Improving Low-resource ASR Using Bilingual Fine-tuning with Language Identification: A Cross-linguistic Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. |
REIHANEH AMOOIE et. al. | arxiv-cs.CL | 2026-06-16 |
| 2 | When Multiple Scripts Matter: Evaluating ASR in Clinical Settings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. |
JEAN SEO et. al. | arxiv-cs.CL | 2026-06-16 |
| 3 | A Hybrid RWHISYMP Speech-to-Text Noise Suppression Model: Integration of The Whisper Base Model, RNNoise, and SympSpell Algorithms Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study introduces the RWhiSymp, a hybrid speech-to-text noise suppression model that integrates three components: RNNoise for noise suppression, Whisper Base Model for ASR, and SymSpell for spelling correction. |
Mariann F. Bragas; Laurence D. Ganda; Leonila R. Juanatas MIT; Charisse S. Ronquillo MIT; | American Journal of Smart Technology and Solutions | 2026-06-16 |
| 4 | Are You Speaking My Languages? On Spoken Language Adherence in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. |
Hyungwon Kim; Kandarp Joshi; Lillian Zhou; Pavel Golik; Petar Aleksic; | arxiv-cs.CL | 2026-06-15 |
| 5 | ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. |
ANDREI-MARIUS AVRAM et. al. | arxiv-cs.CL | 2026-06-14 |
| 6 | Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods. |
Henri-Leon Kordt; Theresa Pekarek Rosin; Jae Hee Lee; Stefan Wermter; | arxiv-cs.CL | 2026-06-12 |
| 7 | A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a practical evaluation method for long-form SimulS2ST. |
Yulin Xue; Siqi Ouyang; Lei Li; | arxiv-cs.CL | 2026-06-12 |
| 8 | MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. |
Theresa Pekarek Rosin; Matthias Kerzel; Stefan Wermter; | arxiv-cs.CL | 2026-06-12 |
| 9 | Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. |
Ravi Ranjan; Utkarsh Grover; Xiaomin Lin; Agoritsa Polyzou; | arxiv-cs.SD | 2026-06-12 |
| 10 | Towards Personalized Federated Learning for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. |
Tao Zhong; Mengzhe Geng; Jiajun Deng; Shujie Hu; Xunying Liu; | arxiv-cs.SD | 2026-06-11 |
| 11 | Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. |
XINXIN LI et. al. | arxiv-cs.CL | 2026-06-11 |
| 12 | PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. |
GIANG SON NGUYEN et. al. | arxiv-cs.CL | 2026-06-11 |
| 13 | Rethinking Text-based Extractive Speech Summarization in Noisy ASR Settings for Low-resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Priyanjana Chowdhury; Sanghamitra Nath; Utpal Sharma; | Multimedia Tools and Applications | 2026-06-10 |
| 14 | Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. |
Catherine Bao; Maneesha Rani Saha; Neal Patwari; | arxiv-cs.CL | 2026-06-10 |
| 15 | Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. |
Guodong Lin; Ziqi Chen; Yuxiang Fu; Ke Li; Wei-Qiang Zhang; | arxiv-cs.SD | 2026-06-09 |
| 16 | AuRA: Internalizing Audio Understanding Into LLMs As LoRA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. |
BO CHENG et. al. | arxiv-cs.LG | 2026-06-09 |
| 17 | Speaker Group Encoding in Self-supervised Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). |
Felix Herron; Solange Rossato Alexandre Allauzen; Benoit Favre; François Portet; | arxiv-cs.CL | 2026-06-09 |
| 18 | Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. |
XUANCHEN LI et. al. | arxiv-cs.SD | 2026-06-08 |
| 19 | Is Text All You Need? Text As A Universal Information Bottleneck for Speech LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM’s input embedding manifold with an architectural convex-hull constraint. |
MING-HAO HSU et. al. | arxiv-cs.CL | 2026-06-08 |
| 20 | TRADE: Transducer-Augmented Decoder for Speech LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM’s hidden states directly as the prediction network — coupling frame-synchronous acoustic alignment with the LLM’s linguistic reasoning. |
Yun Tang; Shanil Puri; Shinji Watanabe; Subhabrata Mukherjee; | arxiv-cs.CL | 2026-06-07 |
| 21 | Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. |
TUNG X. NGUYEN et. al. | arxiv-cs.CL | 2026-06-05 |
| 22 | Enhancing Multilingual Automatic Speech Recognition for Low-resource Code-switched Languages: A Scalable Data Augmentation Strategy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The proposed system aims to generate natural-sounding Egyptian Colloquial Arabic (ECA) speech and improve Automatic Speech Recognition (ASR) performance, particularly for code-switched utterances. To achieve this, we collected data from transcribed online videos to fine-tune a text-to-speech (TTS) model (XTTSv2) for ECA. |
Mohab Mostafa Morsi; Radwa Fathalla; Sherif Abdou; Mohamed Waleed Fakhr; | PeerJ Computer Science | 2026-06-04 |
| 23 | Hearing The Unspoken: Language Model Priors for Acoustic Adversarial Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines. |
Jiani Xie; Andrew C. Cullen; Paul Montague; Benjamin I. P. Rubinstein; | arxiv-cs.LG | 2026-06-04 |
| 24 | FiLM-Based Speaker Conditioning of A SpeechLLM for Pathological Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. |
Fernando López; Santosh Kesiraju; Jordi Luque; | arxiv-cs.CL | 2026-06-04 |
| 25 | Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. |
YIFAN LIAO et. al. | arxiv-cs.SD | 2026-06-04 |
| 26 | Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to apply soft token assignment only during downstream inference. |
Kentaro Onda; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2026-06-04 |
| 27 | Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. |
Gio Paik; Hyunseo Shin; Soungmin Lee; | arxiv-cs.CL | 2026-06-04 |
| 28 | Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. |
YACOUBA KALOGA et. al. | arxiv-cs.LG | 2026-06-03 |
| 29 | Efficient ASR Training with Conversations That Never Happened Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. |
Máté Gedeon; Péter Mihajlik; | arxiv-cs.CL | 2026-06-02 |
| 30 | BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for The Balti Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. |
Muhammad Ali; | arxiv-cs.CL | 2026-06-02 |
| 31 | WAXAL-NET: Finetuned Edge ASR Across 19 African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. |
VICTOR TOLULOPE OLUFEMI et. al. | arxiv-cs.CL | 2026-06-01 |
| 32 | SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. |
Priyaranjan Pattnayak; | arxiv-cs.CL | 2026-06-01 |
| 33 | MURMUR: An Efficient Inference System for Long-Form ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. |
Wei-Tzu Lee; Keisuke Kamahori; Baris Kasikci; | arxiv-cs.LG | 2026-05-31 |
| 34 | Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an event-driven SpeechMamba with FATReLU activation, achieving over 60% activation sparsity with less than 1% accuracy degradation on LibriSpeech. |
Tauseef Ahmed; Tao Sun; Jeronimo Castrillon; Kanishkan Vadivel; Guangzhi Tang; | arxiv-cs.NE | 2026-05-31 |
| 35 | PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level’ speech comprehension across 110 linguistic variants. |
SICHENG YANG et. al. | arxiv-cs.CL | 2026-05-31 |
| 36 | A Deep Learning Framework for Speech Emotion Recognition: A Gender-Aware Hierarchical Pipeline with Optimized 18-Layer Convolutional Neural Network Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Real-world acoustic signals are heavily influenced by environmental noise, speaker idiosyncrasies, and physical variability across genders. This paper introduces a high-performance, structurally optimized hierarchical framework that addresses these limitations through three primary contributions: (1) a dense 182-feature extraction pipeline unifying spectral, linear predictive, dynamic energy, prosodic, and statistical shape profiles; (2) an early-stage, gender-aware hierarchical pipeline driven by a Gender Recognition (GR) circuit that splits the processing stream based on fundamental frequency distribution to eliminate cross-gender acoustic overlaps; and (3) a customized 18-layer Deep Convolutional Neural Network (CNN) integrated with meta-heuristic hyper-parameter optimization. |
Savita Jain; | International Journal For Multidisciplinary Research | 2026-05-30 |
| 37 | Speech Recognition Systems Using Deep Learning Accuracy Vs. Efficiency Trade-offs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores the balance between recognition accuracy and computational efficiency in deep learning–based speech recognition systems, analyzing major architectures, optimization techniques, performance challenges, and emerging approaches for scalable, efficient, and intelligent ASR deployment across modern digital ecosystems. |
Dandao Xuebao/Journal of Ballistics | 2026-05-29 | |
| 38 | SALSA: Speech Aware LLM Adaptation Via Learned Steering Activation Vectors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. |
YEKATERINA YEGOROVA et. al. | arxiv-cs.CL | 2026-05-29 |
| 39 | Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. |
ZIXUAN JIANG et. al. | arxiv-cs.AI | 2026-05-28 |
| 40 | Data-Efficient On-Policy Distillation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. |
Yu Lin; Yiming Wang; Runyuan Cai; Xiaodong Zeng; | arxiv-cs.AI | 2026-05-27 |
| 41 | Breaking The Script Barrier: Enabling Automatic Alignment for PoS-based ASR Error Analysis in Non-Latin Scripts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing alignment tools are often unreliable for languages written in non-Latin scripts. In this work, we address this gap by proposing a robust, automated, language-agnostic alignment mechanism applicable across ASR architectures and across languages written in both Latin and non-Latin scripts. |
Prasenjit K Mudi; Dahlia Devapriya; Sheetal Kalyani; | arxiv-cs.CL | 2026-05-27 |
| 42 | Building Community-Centred NLP Resources for Puno Quechua Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models. |
Elwin Huaman; Adrian Gamarra Lafuente; Johanna Cordova; Anna Korhonen; | arxiv-cs.CL | 2026-05-27 |
| 43 | KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. |
Haechan Kim; Seungjun Chung; Inkyu Park; Jihoo Lee; Jonghyun Lee; | arxiv-cs.CL | 2026-05-27 |
| 44 | Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. |
Nghia Hieu Nguyen; Quan Ngoc Hoang; Long Hoang Huu Nguyen; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen; | arxiv-cs.CL | 2026-05-26 |
| 45 | PashtoTTS-Bench: Automated Screening for Low-resource Non-Latin-script Text-to-speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. |
Hanif Rahman; | arxiv-cs.CL | 2026-05-26 |
| 46 | How to Evaluate Speech Translation with Source-Aware Neural MT Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. |
Mauro Cettolo; Marco Gaido; Matteo Negri; Sara Papi; Luisa Bentivogli; | Computational Linguistics | 2026-05-26 |
| 47 | TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. |
XINYU WANG et. al. | arxiv-cs.CL | 2026-05-26 |
| 48 | FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. |
FRANCISCO TEIXEIRA et. al. | arxiv-cs.CL | 2026-05-26 |
| 49 | Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. |
YIZHOU PENG et. al. | arxiv-cs.CL | 2026-05-24 |
| 50 | Enhancing Arabic Dialectical Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we present a comparative evaluation of state-of-the-art ASR models, Whisper (medium and large) and Wav2Vec2, applied to five Arabic varieties: Egyptian, Hijazi, Khaliji, Najdi, and MSA. |
Rehab Albeladi; | Islamic University Journal of Applied Sciences | 2026-05-23 |
| 51 | How Can Deep Learning Technique Be Leveraged to Enhance ASR Systems for Dysarthric Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This research aims to expand the available data and explore novel approaches in developing robust ASR systems tailored for dysarthric speech. |
Leon Starr; | New Vistas | 2026-05-21 |
| 52 | Mega-ASR: Towards In-the-wild^2 Speech Recognition Via Scaling Up Real-world Acoustic Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. |
ZHIFEI XIE et. al. | arxiv-cs.SD | 2026-05-19 |
| 53 | Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. |
Yun Hao; Reihaneh Amooie; Wietse de Vries; Rik van Noord; Martijn Wieling; | arxiv-cs.CL | 2026-05-19 |
| 54 | FormalASR: End-to-End Spoken Chinese to Formal Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. |
WANYI NING et. al. | arxiv-cs.CL | 2026-05-18 |
| 55 | Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English. |
Sajjad Abdoli; Ghassan Al-Sumaidaee; Clayton W. Taylor; Ahmed Rashad; | arxiv-cs.CL | 2026-05-18 |
| 56 | PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. |
Sicheng Jin; Dipankar Srirag; Aditya Joshi; | arxiv-cs.CL | 2026-05-18 |
| 57 | A Context Aware Igbo Language Voice Assistant Using Natural Language Processing Tools Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The aim of this research is to develop a prototypecontext aware voice assistant system for the Igbo language using natural language processing tools. |
Emmanuel Nwabueze Ekwonwune; Leticia E. Elebiri; Abraham Oghenemega Ovwonuri; Chibundu D. Asiegbu; Netochukwu Nwokafor; | Scholarly Journal of Science and Technology Research and … | 2026-05-17 |
| 58 | Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages Via Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region. |
Sewade Ogun; | arxiv-cs.CL | 2026-05-17 |
| 59 | Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. |
Donghyuk Jung; Youngwon Choi; | arxiv-cs.CL | 2026-05-17 |
| 60 | JSPG: Dynamic Dictionary Filtering Via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). |
Shilin Zhou; Zhenghua Li; | arxiv-cs.CL | 2026-05-16 |
| 61 | A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system. |
Sunil Kumar Kopparapu; | arxiv-cs.CL | 2026-05-14 |
| 62 | JODAL: Joint Domain Adversarial Learning for TTS-Augmented Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although we expect that training with both real and synthetic datasets together can mitigate the data scarcity issue while maintaining ASR performance, we observe that the distributional discrepancy between real and synthetic data introduces a domain inconsistency problem, which negatively impacts ASR performance. To mitigate this, we propose a Joint Domain Adversarial Learning (JODAL) method using real and TTS-generated samples that mitigates domain mismatches through domain adversarial training. |
June-Woo Kim; Ho-Young Jung; | Mathematics | 2026-05-14 |
| 63 | Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. |
Ryo Magoshi; Takashi Maekaku; Yusuke Shinohara; | arxiv-cs.SD | 2026-05-14 |
| 64 | Automatic Speech Recognition in Healthcare in The Post-LLM Era: A Scoping Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While traditional ASR focused on transcription fidelity, LLM-based systems extend this capability to intelligently reason, summarize, and structure clinical data. This scoping review maps the emerging landscape of LLM-based ASR in healthcare, examining its applications, technical foundations, evaluation practices, and reported challenges. |
Maram Alabbad; Waad Alhoshan; | Healthcare | 2026-05-13 |
| 65 | Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. |
Kush Juvekar; Kavya Manohar; Aditya Srinivas Menon; Arghya Bhattacharya; Kumarmanas Nethil; | arxiv-cs.CL | 2026-05-13 |
| 66 | Automatic Speech Recognition and Large Language Models for Multilingual Pathology Report Generation: Proof-of-Concept Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Methods We conducted a controlled proof-of-concept study using 125 simulated mixed Chinese-English pathology gross examination audio recordings created by physicians or pathologists. |
KUAN-HSUN LIN et. al. | JMIR Formative Research | 2026-05-13 |
| 67 | Responsible Benchmarking of Fairness for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that evaluating fairnessbased on single heterogeneous SG’s, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG’s are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations |
Felix Herron; Ange Richard; François Portet; Alexandre Allauzen; Solange Rossato; | arxiv-cs.CL | 2026-05-11 |
| 68 | Meri Awaaz Hi Pehchan Hai: A Survey on Multilingual Podcast Processing System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The survey reviews multilingual podcast processing and summarization techniques, including Automatic Speech Recog-nition (ASR), Natural Language Processing (NLP), extractive and abstractive summarization, machine translation, and Text-to-Speech systems. |
Gayana V; Gagana Poojari; Bhagyashree Bhagyashree; Arpita Rathod; Dr. Vijayalaxmi Mekali,; | International Scientific Journal of Engineering and … | 2026-05-10 |
| 69 | Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. |
Erfan Loweimi; Sofia de la Fuente Garcia; Samira Loveymi; Hadi Daneshvar; Saturnino Luz; | arxiv-cs.CL | 2026-05-10 |
| 70 | AI-Based Real-Time Speech-to-Sign Language Translation System for Assistive Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the design and development of an Artificial Intelligence (AI)-based, real-time Speech-to- Indian Sign Language (ISL) Translation System that bridges the communication gap between the hearing population and the deaf and hard-of-hearing community. |
Y. Dayanand Kumar Y. Dayanand Kumar; Bhukya Siddhu Bhukya Siddhu; | International Journal of Creative and Open Research in … | 2026-05-09 |
| 71 | WorldSpeech: A Multilingual Speech Corpus from Around The World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. |
Antonis Asonitis; Luca A. Lanzendörfer; Frédéric Berdoz; Roger Wattenhofer; | arxiv-cs.CL | 2026-05-09 |
| 72 | WASIL: In-the-Wild Arabic Spoken Interactions with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. |
ZIEN SHEIKH ALI et. al. | arxiv-cs.SD | 2026-05-09 |
| 73 | Real Time Translation and Emotional Intelligent Voice Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current speech-to-speech translation systems often struggle to capture the original speaker’s vocal identity and emotional tone, resulting in robotic and unnatural conversations. To address this, we introduce the Real Time Translation and Emotional Intelligent Voice Model. |
Nooh C. H.; | International Journal for Research in Applied Science and … | 2026-05-08 |
| 74 | Beyond Single Ground Truth: Reference Monism As Epistemic Injustice in ASR Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer. |
ANNA SEO GYEONG CHOI et. al. | arxiv-cs.CL | 2026-05-07 |
| 75 | Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks. |
MOHAMMED AMAN BHUIYAN et. al. | arxiv-cs.SD | 2026-05-06 |
| 76 | Hearing Without Noticing? Attention-Aware Stealthy Black-box Adversarial Audio Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the first music carrier selection algorithm and an attention-aware stealthiness loss function to generate stealthy AEs. |
Cheng'an Wei; Yue Zhao; Kai Chen; | icml | 2026-05-05 |
| 77 | A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition Applied on French Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct a qualitative study on the French language that investigates the impact of subword tokenization algorithms and self-supervised learning models from different linguistic and acoustic perspectives, using a comprehensive set of evaluation metrics. |
Thibault Bañeras-Roux; Mickael Rouvier; Jane Wottawa; Richard Dufour; | arxiv-cs.CL | 2026-05-05 |
| 78 | MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. |
Yuxuan Lou; Kai Yang; Yang You; | icml | 2026-05-05 |
| 79 | Reducing Prompt Sensitivity in LLM-Based Speech Recognition Through Learnable Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. |
S. Burdisso; | icassp | 2026-05-04 |
| 80 | Synthesized Data Selection Via Score Distribution Matching for Te Reo Māori Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fine-tuning pre-trained Automatic Speech Recognition (ASR) models with synthesized datasets leads to suboptimal performance in low-resource ASR, due to domain mismatches between true and synthesized speech data. To address this issue, we propose a novel score-distribution-matching method for data selection, which filters synthesized data by explicitly aligning its score distribution with the prior distribution of the true data score. |
Z. Wang; F. Hou; R. Wang; | icassp | 2026-05-04 |
| 81 | Knowledge Distillation Via Generative Reconstruction Pathways for End-to-End Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a generative reconstruction pathway knowledge distillation (GRP-KD) framework that leverages flow matching and diffusion processes to achieve richer knowledge transfer. |
S. -H. Jeong; D. -H. Kim; J. -H. Chang; | icassp | 2026-05-04 |
| 82 | Advancing LLM-Based Multi-Channel Multi-Speaker Speech Recognition with Global Cross-Channel Attention and Sentence-Ordered First-In First-Out Serialized Output Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While multi-microphone input outperforms single-microphone setups in multi-speaker scenarios, multi-channel multi-speaker ASR remains challenging due to data scarcity, complex acoustic conditions, and cross-channel dependency modeling limitations. To address these challenges, we propose a novel framework integrating a Large Language Model (LLM) into multi-channel multi-speaker ASR for the first time. |
G. Wan; | icassp | 2026-05-04 |
| 83 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using In-the-wild Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenges, we introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that uses Large Language Models (LLMs) to correct pseudo-labels generated on in-the-wild data. |
W. Ding; F. Qian; | icassp | 2026-05-04 |
| 84 | Chunkwise Aligners for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). |
W. S. Teo; T. Moriya; M. Mimura; | icassp | 2026-05-04 |
| 85 | SED: Structural Entropy Based Speech Discretization for Discrete Token-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SED, a new Structural Entropy-based Speech Discretization method that models speech features as graph nodes and performs adaptive clustering by minimizing 2D Structural Entropy. |
L. Dong; W. Wang; Y. Xiang; Y. Xian; S. Gao; | icassp | 2026-05-04 |
| 86 | Tokenchain: A Discrete Speech Chain Via Semantic Token Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. |
M. Wang; S. Nakamura; | icassp | 2026-05-04 |
| 87 | Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). |
M. KOCOUR et. al. | icassp | 2026-05-04 |
| 88 | MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. |
J. Mai; | icassp | 2026-05-04 |
| 89 | BEST-RQ-based Self-Supervised Learning for Whisper Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper’s encoder with unlabeled data. |
R. Bagat; I. Illina; E. Vincent; | icassp | 2026-05-04 |
| 90 | Lattice-Guided Consistency Regularization of Dual-Mode Transducers for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel approach to train a dual-mode transducer model that supports both autoregressive (AR) and non-autoregressive (NAR) inference using lattice-based consistency regularization for Automatic Speech Recognition (ASR). |
W. Ding; H. Xu; J. Balam; J. Lai; | icassp | 2026-05-04 |
| 91 | Leveraging Segment-Level Speech Representations for LLM-Based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes an LLM-ASR framework based on segment-level speech representations. |
S. Jiang; L. Dong; W. Wang; S. Gao; | icassp | 2026-05-04 |
| 92 | Medical ASR Enhancement By Domain-Specific Reinforcement Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art systems optimize all tokens equally, under-weighting rare, domain-critical terms. We propose a medical-aware reinforcement fine-tuning (RFT) framework that generates multiple hypotheses, extracts medical terms with LLMs, validates them via UMLS, and optimizes a composite reward combining WER, medical WER (MWER), and length regularization. |
C. Wang; | icassp | 2026-05-04 |
| 93 | GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. |
P. Mehralian; M. Farasyn; A. Breitbarth; A. -S. Ghyselen; H. Van Hamme; | icassp | 2026-05-04 |
| 94 | Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. |
G. WAN et. al. | icassp | 2026-05-04 |
| 95 | Synthetic Data Domain Adaptation for ASR Via LLM-Based Text and Phonetic Respelling Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a synthetic-data-based domain adaptation framework with two contributions: (1) a large language model (LLM)-based text augmentation pipeline with a filtering strategy that balances lexical diversity, perplexity, and domain-term coverage, and (2) phonetic respelling augmentation (PRA), a novel method that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings. |
N. Yamashita; K. Nagatsuka; H. Kokubo; K. Dohi; T. V. Ho; | icassp | 2026-05-04 |
| 96 | CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). |
M. Shakeel; Y. Fukumoto; C. Maeda; C. -J. Lin; S. Watanabe; | icassp | 2026-05-04 |
| 97 | BBPE16: UTF-16-Based Byte-Level Byte-Pair Encoding for Improved Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. |
H. Kim; H. Kim; M. Lee; K. Lee; | icassp | 2026-05-04 |
| 98 | LOTUSDIS: A Thai Far-Field Meeting Corpus for Robust Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LOTUSDIS, a publicly available Thai meeting corpus designed to advance far-field conversational ASR. |
P. Tipaksorn; S. Thatphithakkul; V. Chunwijitra; K. Thangthai; | icassp | 2026-05-04 |
| 99 | Phoneme-Level Visual Speech Recognition Via Point-Visual Fusion and Language Model Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this, our work focuses on phoneme-level modeling, which provides a finer linguistic granularity than visemes while reducing the reliance on large-scale pretraining. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by a Large Language Model (LLM) for sentence reconstruction to address these challenges. |
M. K. K. Teng; H. Zhang; T. Saitoh; | icassp | 2026-05-04 |
| 100 | Whisper-QF: Leveraging Dual Cross-Attention Q-Former for Speech Emotion Recognition With Multi-Task Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Whisper-QF, a multi-task framework that integrates Whisper with a lightweight query-based transformer module (Q-Former) to enhance emotional understanding. |
Z. Zhuang; T. Wei; Y. Shi; S. Wang; J. Xiao; | icassp | 2026-05-04 |
| 101 | Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). |
M. Shi; X. Xiao; R. Fan; S. Ling; J. Li; | icassp | 2026-05-04 |
| 102 | Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. |
P. -J. KU et. al. | icassp | 2026-05-04 |
| 103 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. |
X. Zhang; L. Li; X. Lu; J. Liu; K. A. Lee; | icassp | 2026-05-04 |
| 104 | Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. |
A. S. Menon; K. Tripathi; R. Gohil; P. Wasnik; | icassp | 2026-05-04 |
| 105 | WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose WAV2LEV, a novel paradigm for WER estimation which predicts the underlying sequences of Levenshtein edit operations (substitutions, deletions, insertions and matches) from which the WER can be computed. |
H. Donnelly; K. Shi; G. Penn; | icassp | 2026-05-04 |
| 106 | OMNI-AVSR: Towards Unified Multimodal Speech Recognition With Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. |
U. Cappellazzo; X. Liu; P. Ma; S. Petridis; M. Pantic; | icassp | 2026-05-04 |
| 107 | Confidence-Guided Error Correction for Disordered Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, we propose confidence-informed prompting, where word-level uncertainty estimates are embedded directly into LLM training to improve robustness and generalization across speakers and datasets. |
A. Hernandez; T. Arias-Vergara; A. Maier; P. A. Pérez-Toro; | icassp | 2026-05-04 |
| 108 | Exploring SSL Discrete Tokens for Multilingual Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With the advancement of self-supervised learning (SSL) in speech-related tasks, there has been growing interest in using discrete tokens generated by SSL models for automatic … |
M. Cui; | icassp | 2026-05-04 |
| 109 | Adversarial Fine-Tuning on Speech Foundation Model with Vulnerable Attention Consistency Regularization for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, for the first time, we study how to fine-tune Transformer-based SFMs for adversarially robust ASR, especially preserving utility while improving robustness. |
Y. Wang; B. Wu; L. Liu; | icassp | 2026-05-04 |
| 110 | Decoder-Only Conformer with Modality-Aware Sparse Mixtures of Experts for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). |
J. Lee; M. Mimura; | icassp | 2026-05-04 |
| 111 | Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a method that decouples the inference cost of activity-conditioned ASR systems from the number of speakers by converting speaker-specific activity outputs into two speaker-agnostic streams. |
X. He; A. Polok; J. Villalba; T. Thebaud; M. Maciejewski; | icassp | 2026-05-04 |
| 112 | Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore contextual biasing in SLLM based on acoustic cues associated with a set of common words whose pronunciations are partially similar to those of the target bias words. |
S. Novitasari; T. Fukuda; G. Kurata; G. Saon; | icassp | 2026-05-04 |
| 113 | DOMA: Leveraging Diffusion Language Models with Adaptive Prior for Intent Classification and Slot Filling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose DOMA, a model-agnostic framework that refines ASR transcripts for ICSF with diffusion language models (DLMs) and adaptive prior. |
S. Yang; | icassp | 2026-05-04 |
| 114 | Listen, But Don’t Leak: Sensitive Data Protection for Privacy Aware Automatic Speech Recognition with Acoustic Triggers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce Protective Acoustic Triggering (PAT), a novel method for embedding privacy redaction directly into ASR models. |
T. Roy; N. T. Vu; | icassp | 2026-05-04 |
| 115 | Speaker Attributed Automatic Speech Recognition Using Speech Aware Llms Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address limitations in training data, we propose a data augmentation method that uses artificially concatenated multi-speaker conversations. |
H. Aronowitz; Z. Kons; A. Dekel; G. Saon; R. Hoory; | icassp | 2026-05-04 |
| 116 | Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model(LLM)-based agent for post-ASR correction: a Judge–Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. |
X. Zheng; S. Dong; B. Phukon; M. Hasegawa-Johnson; C. D. Yoo; | icassp | 2026-05-04 |
| 117 | Joint Autoregressive Modeling of Multi-Talker Overlapped Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel end-to-end model for multi-talker speech recognition and translation that simultaneously generates multiple transcriptions and their corresponding translations from overlapped speech. |
T. Tanaka; | icassp | 2026-05-04 |
| 118 | Toward Conversational User Interface Via Voice Command Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a speech-command-based correction system that enables users to issue natural-language instructions to refine recognition outputs with minimal effort. |
S. -J. Ding; C. -H. Chang; | icassp | 2026-05-04 |
| 119 | TAGARELA – A Portuguese Speech Dataset from Podcasts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. |
F. S. De Oliveira; | icassp | 2026-05-04 |
| 120 | Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. |
A. Anand; U. Cappellazzo; S. Petridis; M. Pantic; | icassp | 2026-05-04 |
| 121 | Do We Really Need Self-attention for Streaming Automatic Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work questions the suitability of transformers for specific domains. |
Y. Dkhissi; V. Vielzeuf; E. Allesiardo; A. Larcher; | icassp | 2026-05-04 |
| 122 | Align2speak: Improving TTS for Low Resource Languages Via ASR-Guided Online Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. |
S. Hussain; | icassp | 2026-05-04 |
| 123 | Advancing Semi-Supervised Child Speech Recognition with Omni-Temporal Classification Under Label Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate alternative alignment modeling for training child speech ASR systems, relaxing the strict alignment between speech and transcripts. |
J. Xie; J. H. L. Hansen; | icassp | 2026-05-04 |
| 124 | Bridging The Front-End and Back-End for Robust ASR Via Cross-Attention-Based U-Net Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the simple linear fusion mechanism in OA only yields ASR improvements under mild conditions, while its performance degrades substantially in complex acoustic environments. To overcome this limitation, we propose a cross-attention-based U-Net module designed to effectively achieve interactive feature fusion. |
T. NING et. al. | icassp | 2026-05-04 |
| 125 | Reference Microphone Selection for Guided Source Separation Based on The Normalized L-P Norm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized ℓp-norm, either using only the normalized ℓp-norm or combining the normalized ℓp-norm and the SNR to account for both differences in SNR and ELR across microphones. |
A. LOHMANN et. al. | icassp | 2026-05-04 |
| 126 | Noise-Robust AV-ASR Using Visual Features Both in The Whisper Encoder and Decoder Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. |
Z. LI et. al. | icassp | 2026-05-04 |
| 127 | Mixture To Beamformed Mixture: Leveraging Beamformed Mixture As Weak-Supervision for Speech Enhancement and Noise-Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In multi-channel speech enhancement and robust automatic speech recognition (ASR), beamforming can typically improve the signal-to-noise ratio (SNR) of the target speaker and produce reliable enhancement with little distortion to target speech. With this observation, we propose to leverage beamformed mixture, which has a higher SNR of the target speaker than the input mixture, as a weak supervision to train deep neural networks (DNNs) to enhance the input mixture. |
Z. -Q. Wang; R. Pang; | icassp | 2026-05-04 |
| 128 | Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). |
M. WANG et. al. | icassp | 2026-05-04 |
| 129 | LLM-Based Post-ASR Error Correction for Disordered Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the first systematic study of large language model (LLM)-based post-ASR error correction for disordered speech. |
H. Wen; M. Assefa; A. Semsayan; E. Feo-Flushing; | icassp | 2026-05-04 |
| 130 | Multi-Task Learning For Speech Quality Assessment Using ASR-Derived Entropy Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel approach that exploits entropy computed from automatic speech recognition (ASR) model predictions as a quality indicator for Non-reference speech quality assessment approach. |
T. D. Do; B. Thang Ta; V. H. Do; | icassp | 2026-05-04 |
| 131 | AccLID: Accent-aware Language Identification for Robust Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose AccLid, a multi-modal ranking framework that improves accent robustness by integrating acoustic, temporal, linguistic, and phonetic evidence. |
R. Singh; | icassp | 2026-05-04 |
| 132 | SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. |
P. Wang; S. Watanabe; H. Van Hamme; | icassp | 2026-05-04 |
| 133 | A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Detecting toxic spans in spoken content is crucial for content moderation, but conventional two-stage pipelines that run Automatic Speech Recognition (ASR) followed by a Toxic Span Detection (TSD) model suffer from high latency and system complexity. To address these limitations, we propose a multi-task framework for Vietnamese audio-based toxic span detection that unifies ASR and TSD modules within a single model. |
V. L. -P. Huynh; H. B. Do; L. T. Nguyen; | icassp | 2026-05-04 |
| 134 | Group Relative Policy Optimization for Text-to-Speech with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a GRPO-based approach to enhance the performance of large language model (LLM)-based text-to-speech (TTS) models by deriving rewards from an off-the-shelf automatic speech recognition (ASR) model. |
C. Liu; Y. -J. Hu; Y. -Y. Gao; S. -L. Zhang; Z. -H. Ling; | icassp | 2026-05-04 |
| 135 | Enhancing Multilingual LLM-Based ASR with Mixture of Experts and Dynamic Downsampling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. |
G. Lin; Z. Chen; Y. Fu; K. Li; W. -Q. Zhang; | icassp | 2026-05-04 |
| 136 | TextlessRAG: End-to-End Visual Document RAG By Speech Without Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. |
P. XIE et. al. | icassp | 2026-05-04 |
| 137 | TTA: Transcribe, Translate and Alignment for Cross-Lingual Speech Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a lightweight TTA model specialized in speech semantics for more effective LLM integration. |
W. Liu; J. Li; Y. Shao; D. Yu; | icassp | 2026-05-04 |
| 138 | NGPT As A Scalable Architecture for Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The model introduces a hyperspherical normalization method that enables stable large-scale training without convergence difficulties. |
N. Tadevosyan; | icassp | 2026-05-04 |
| 139 | HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-Based TTS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. |
S. Nie; X. Xing; J. Xing; B. Liu; X. Xu; | icassp | 2026-05-04 |
| 140 | Target-Speaker LLM-ASR with Speaker-Aware Speech Encoder Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a TS-ASR system based on large language models (LLMs), featuring a speaker-aware speech encoder (SASE) that extends conventional LLM-based ASR to the target-speaker scenario while preserving the original structure. |
M. Kim; S. Kim; | icassp | 2026-05-04 |
| 141 | Improving Automatic Speech Recognition By Mitigating Distortions Introduced By Speech Enhancement Under Drone Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a method to improve automatic speech recognition (ASR) performance against drone noise-contaminated audio recordings by mitigating distortions resulted from speech enhancement (SE). |
R. Miura; T. Osaki; B. Yen; T. Ashizawa; K. Nakadai; | icassp | 2026-05-04 |
| 142 | Probing The Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the untapped potential of Whisper [1], a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). |
F. -A. Chao; B. -C. Yan; B. Chen; | icassp | 2026-05-04 |
| 143 | Mind The Shift: Using Delta SSL Embeddings to Enhance Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. |
Z. Wang; N. B. Shankar; K. Zhang; Z. Wang; A. Alwan; | icassp | 2026-05-04 |
| 144 | Production-Scale Dynamic Vocabulary ASR Biasing with Word-Level FST and Robust Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Third, the general ASR accuracy degrades relative to a vanilla ASR system and usually deletes entities that are hard to recognize, which poses an issue for the no-bias condition, a scenario that can be quite common in production deployment. In this work, we present several enhancements to the original DynVoc model that mitigate these problems, scale the solution to production-level ASR and achieve substantial gains over the vanilla DynVoc method. |
J. E. García Lainez; T. Sun; S. Ling; Y. Gong; H. Wang; | icassp | 2026-05-04 |
| 145 | Lamer-SSL: Layer-Aware Mixture of Lora Experts for Continual Multilingual Expansion of Self-Supervised Models Without Forgetting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. |
J. Xu; M. Wu; X. Chen; X. Wu; H. Meng; | icassp | 2026-05-04 |
| 146 | PhoenixDSR: Phoneme-Guided and LLM-Enhanced Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present PhoenixDSR, a phoneme-mediated framework that decouples acoustic variability from linguistic decoding. |
Y. WU et. al. | icassp | 2026-05-04 |
| 147 | PAC: Pronunciation-Aware Contextualized Large Language Model-Based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. |
L. FU et. al. | icassp | 2026-05-04 |
| 148 | Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. |
J. -W. Kim; D. Agarwal; F. Cerina; | icassp | 2026-05-04 |
| 149 | Revisiting Direct Speech-to-Text Translation with Speech LLMS: Better Scaling Than Cot Prompting? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. |
O. Pareras; G. I. Gállego; F. Costa; C. España-Bonet; J. Hernando; | icassp | 2026-05-04 |
| 150 | Bridging The Gap: A Comparative Exploration of Speech-LLM and End-to-End Architecture for Multilingual Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an enhanced LLM-based ASR framework that com-bines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. |
Y. Mei; D. Xu; J. Liang; Y. Long; | icassp | 2026-05-04 |
| 151 | Scale: Semantic Chunking and Label-Delay Engine For Streaming Speech-LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches typically employ fixed-size chunking to maintain low latency, which often compromises recognition accuracy. We propose SCALE, a streaming ASR framework that addresses this challenge through three key techniques: (a) dynamic chunk boundary prediction leveraging semantic information, replacing rigid fixed-size chunking, (b) intra-chunk bidirectional attention mechanism for efficient acoustic context modeling, and (c) label delay training strategy enabling stable word predictions with smaller chunks. |
A. Jaiswal; | icassp | 2026-05-04 |
| 152 | In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. |
X. FAN et. al. | icassp | 2026-05-04 |
| 153 | Advanced Modeling of Interlanguage Speech Intelligibility Benefit with L1-L2 Multi-task Learning Using Differentiable K-means for Accent-robust Discrete Token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a more advanced modeling of ISIB. |
K. Onda; S. Fukayama; D. Saito; N. Minematsu; | icassp | 2026-05-04 |
| 154 | Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating for Robust Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Language-Infused Retrieval-Augmented CTC with Adaptive Soft-Hard Gating (LIRA-CTC), which explicitly incorporates language posterior information into the retrieval process. |
Z. Liang; S. Nakamura; | icassp | 2026-05-04 |
| 155 | MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. |
Y. Song; | icassp | 2026-05-04 |
| 156 | Evaluating ASR-LLM Setups for Japanese Speech Recognition with Multi-Pass Augmented Generative Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work explores how LLM-based GER can enhance and expand the capabilities of Japanese ASR, providing a systematic evaluation on Japanese data with 0.9-2.6k text utterances. |
Y. Ko; S. Li; C. -H. H. Yang; T. Kawahara; | icassp | 2026-05-04 |
| 157 | From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It is the generative decoder’s attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. |
J. Yi; M. Kim; | icassp | 2026-05-04 |
| 158 | Dual-Grained Routing Guided Multi-LoRA Experts for Multilingual Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel dual-grained routing guided Multi-LoRA approach for multilingual low-resource ASR. |
Y. Wang; H. Zhang; H. Wang; L. Sun; M. Song; | icassp | 2026-05-04 |
| 159 | Impact of Phonetics on Speaker Identity in Adversarial Voice Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we analyze adversarial audio at the phonetic level and show that perturbations are associated with systematic phonetic tendencies, such as vowel centralization and consonant substitutions. |
D. K. Dar; Q. Yan; L. Xiao; A. Ross; | icassp | 2026-05-04 |
| 160 | OCR-Enhanced Multimodal ASR Can Read While Listening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. |
J. Chen; C. Tang; Y. Li; G. Sun; C. Zhang; | icassp | 2026-05-04 |
| 161 | VOICE2MODE: Phonation Mode Classification in Singing Using Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. |
A. A. Justus; R. Agrawal; S. R. Kadiri; S. Narayanan; | icassp | 2026-05-04 |
| 162 | Inverse-Hessian Regularization for Continual Learning in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Inverse Hessian Regularization (IHR), a memory-free approach for CL in ASR that incorporates curvature information into the merging step. |
S. Vander Eeckt; H. Van Hamme; | icassp | 2026-05-04 |
| 163 | Chain of Correction for Full-Text Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. |
Z. TANG et. al. | icassp | 2026-05-04 |
| 164 | Whisper: Courtside Edition – Enhancing ASR Performance Through LLM-Driven Context Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline tailored for the domain of basketball commentary, that enhances Whisper’s transcription accuracy without any model retraining. |
Y. Ron; S. Gilboa; T. Dubnov; | icassp | 2026-05-04 |
| 165 | Frontend Token Enhancement for Token-Based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. |
T. Ashihara; S. Horiguchi; K. Matsuura; T. Ochiai; M. Delcroix; | icassp | 2026-05-04 |
| 166 | HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. |
Y. SI et. al. | icassp | 2026-05-04 |
| 167 | Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. |
N. Pokel; P. Moure; R. Boehringer; S. -C. Liu; Y. Gao; | icassp | 2026-05-04 |
| 168 | CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). |
M. U. Farooq; O. Saz; | icassp | 2026-05-04 |
| 169 | Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. |
I. Pandey; A. Mittal; V. Bahuguna; G. Ramakrishnan; | icassp | 2026-05-04 |
| 170 | A Dual-Channel ASR-LLM Architecture with A Progressive Training Strategy for Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome this limitation and enable a more comprehensive flow of information, we propose a novel dual-channel architecture,termed DCPT, that provides the LLM with both information streams simultaneously, integrating acoustic evidence from the encoder as a foundational anchor with evolving linguistic hypotheses from the decoder as dynamic guidance. To train this complex system effectively, we introduce a progressive two-stage training strategy that first establishes a robust foundational alignment before learning the dynamic fusion, a decoupled approach that ensures stability and efficiency. |
Q. Ma; H. Jia; S. Huang; L. He; | icassp | 2026-05-04 |
| 171 | Learning to Align with Unbalanced Optimal Transport in Linguistic Knowledge Transfer for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. |
X. Lu; P. Shen; H. Kawai; | icassp | 2026-05-04 |
| 172 | When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. |
PEHUÉN MOURE et. al. | arxiv-cs.AI | 2026-05-04 |
| 173 | TMS: Text-Prompted Multi-Channel Speech Separation on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When conversational speech overlaps, the performance of on-device audio assistants on smart glasses degrades markedly. We address this with a text-prompted multichannel speech separation framework. |
Y. Liu; | icassp | 2026-05-04 |
| 174 | Multi Stage Training with Dynamic Data Balancing for Multilingual Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Training large-scale multilingual speech models is often hindered by severe data imbalances across tasks, languages, and corpora. We introduce a systematic, multi-stage training framework to ad-dress this challenge. |
N. KOLUGURI et. al. | icassp | 2026-05-04 |
| 175 | Whisper with Benefits: A Unified Approach to Speech and Speaker Attribute Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified approach that extends Whisper, a state-of-the-art end-to-end ASR model, to simultaneously recognize linguistic content and speaker attributes, including gender, accent, age range, and roles such as doctor and patient, without additional computational overhead. |
R. N. Q. K. Duong; | icassp | 2026-05-04 |
| 176 | Mixtures of Lightweight Articulatory Experts for Multilingual Asr Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose mixtures of lightweight articulatory experts (MoLAE) trained with a novel multilabel articulatory CTC objective. |
M. Mimura; J. Lee; R. Magoshi; T. Kawahara; | icassp | 2026-05-04 |
| 177 | Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. |
M. Hong; | icassp | 2026-05-04 |
| 178 | Ara-BEST-RQ: Multi Dialectal Arabic SSL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. |
H. Elleuch; R. Whetten; S. Mdhaffar; Y. Estève; F. Bougares; | icassp | 2026-05-04 |
| 179 | A Study of Data Selection Strategies for Pre-Training Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. |
R. Whetten; T. Parcollet; M. Dinarelli; Y. Estève; | icassp | 2026-05-04 |
| 180 | Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce two conditions under which unsupervised speech recognition is possible. |
Z. Yang; J. Barkoczi; R. Schlüter; H. Ney; | icassp | 2026-05-04 |
| 181 | Accent-Invariant Automatic Speech Recognition Via Saliency-Driven Spectrogram Masking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. |
M. H. Sameti; S. H. Moridani; A. Zarean; H. Sameti; | icassp | 2026-05-04 |
| 182 | Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using the Common European Framework of Reference for Languages (CEFR)-graded Speak & Improve corpus, we show that naïve fine-tuning of Whisper reduces the average word error rate (WER) but simultaneously widens performance disparities and disproportionately harms lower-proficiency learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. |
L. Sun; C. Zhu; S. Shi; | icassp | 2026-05-04 |
| 183 | SS-JDSC: Single-Speaker Japanese Dysarthric Speech Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing dysarthric-speech corpora have facilitated research progress, but suffer from two major limitations: restricted language coverage and limited data per speaker. In this paper, the SS-JDSC, the first open-source corpus of Japanese dysarthric speech for automatic speech recognition (ASR), is presented to address these challenges. |
A. Ogasawara; S. Takamichi; J. Yang; G. Suenaga; Y. Tan; | icassp | 2026-05-04 |
| 184 | HATS: An Open Data Set Integrating Human Perception Applied to The Evaluation of Automatic Speech Recognition Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigated the relationship between human preferences and various ASR evaluation metrics, including lexical and embedding-based ones, the latter being those that correlate supposedly the most with human perception. |
Thibault Bañeras Roux; Jane Wottawa; Mickael Rouvier; Teva Merlin; Richard Dufour; | arxiv-cs.CL | 2026-04-30 |
| 185 | AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents the AppTek Call-Center Dialogues corpus, a collection of spontaneous, role-played agent-customer conversations spanning fourteen English accents covering sixteen service-oriented scenarios. |
EUGEN BECK et. al. | arxiv-cs.CL | 2026-04-30 |
| 186 | Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. |
Thibault Bañeras-Roux; Mickaël Rouvier; Jane Wottawa; Richard Dufour; | arxiv-cs.CL | 2026-04-30 |
| 187 | Multimodal LLMs Are Not All You Need for Pediatric Speech Language Pathology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a cascading approach from binary classification to type, and symptom classification. |
Darren Fürst; Sebastian Steindl; Ulrich Schäfer; | arxiv-cs.CL | 2026-04-29 |
| 188 | StarDrinks: An English and Korean Test Set for SLU Evaluation in A Drink Ordering Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. |
MARCELY ZANON BOITO et. al. | arxiv-cs.CL | 2026-04-29 |
| 189 | Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. |
Yurii Halychanskyi; Nimet Beyza Bozdag; Mark Hasegawa-Johnson; Dilek Hakkani-Tür; Volodymyr Kindratenko; | arxiv-cs.SD | 2026-04-29 |
| 190 | Assessing The Reliability, Accuracy, and Relevance of Artificial Intelligence Speech Recognition for Clinical Documentation: A Scoping Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Methods The scoping review employed the methodology developed by Arksey and O’Malley in 2005 and further expanded by Levac and Colquhoun 2010. |
Samuel Atiku; Kehinde Owolanke; Olufisayo Olakotan; | Journal of Evaluation in Clinical Practice | 2026-04-29 |
| 191 | WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. |
Erfan Ramezani; Mohammad Mahdi Giahi; Mohammad Erfan Zarabadipour; Amir Reza Yosefian; Hamid Ghadiri; | arxiv-cs.CL | 2026-04-28 |
| 192 | A Comprehensive Review of Speech Recognition Systems: Architectures, Techniques, and Future Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive review of speech recognition systems, focusing on architectural developments, modeling techniques, datasets, evaluation metrics, and applications. |
Geddamuri Purna Narayanamma; Nandamuri Navya Sri Madhu Viharika; Tetala Shaharsha Devi; Miriyala Kavya; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2026-04-27 |
| 193 | AI-Driven Web-Based Video Conferencing System with Real-Time Meeting Intelligence and Analytics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an AI-Driven Web-Based Video Conferencing System that integrates real-time communication with an embedded artificial intelligence analytics pipeline. |
Asina Begam A; Ramya K; Anushiya V; Dharshani A; Dharshini T; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2026-04-27 |
| 194 | RAS: A Reliability Oriented Metric for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. |
WENBIN HUANG et. al. | arxiv-cs.SD | 2026-04-27 |
| 195 | Applying Deep Learning Algorithms for Speech Recognition in Speech-Impaired Children Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates the application of deep learning algorithms for automatic speech recognition (ASR) specifically adapted to the speech patterns of children with speech-language disorders. |
Madagala Surya Satya Sai Anjali4 ; Dangeti Praveena1,; Guthi Lalitha Veera Bharathi2; Inti Visalakshi3 ; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2026-04-27 |
| 196 | 2nd of The 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. |
Zhiyu Wang; Xudong Kang; Shutao Li; | arxiv-cs.CV | 2026-04-26 |
| 197 | Identifying and Typifying Demographic Unfairness in Phoneme-level Embeddings of Self-supervised Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. |
Felix Herron; Solange Rossato; Alexandre Allauzen; François Portet; | arxiv-cs.CL | 2026-04-24 |
| 198 | Evaluation of Automatic Speech Recognition Using Generative Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. |
THIBAULT BAÑERAS-ROUX et. al. | arxiv-cs.CL | 2026-04-23 |
| 199 | Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. |
KAUSHAL BHOGALE et. al. | arxiv-cs.CL | 2026-04-21 |
| 200 | UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. |
Yadong Li; Guoxin Wu; Haiping Hou; Biye Li; | arxiv-cs.AI | 2026-04-21 |
| 201 | Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a phoneme-level analysis of automatic speech recognition (ASR) for two low-resourced and phonologically complex East Caucasian languages, Archi and Rutul, based on curated and standardized speech-transcript resources totaling approximately 50 minutes and 1 hour 20 minutes of audio, respectively. |
V. S. D. S. Mahesh Akavarapu; Michael Daniel; Gerhard Jäger; | arxiv-cs.CL | 2026-04-20 |
| 202 | Where Do Self-Supervised Speech Models Become Unfair? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To our knowledge, we present the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing each embedding layer for speaker identification (SID) automatic speech recognition (ASR). |
Felix Herron; Maja Hjuler; Solange Rossato; Alexandre Allauzen; François Portet; | arxiv-cs.CL | 2026-04-20 |
| 203 | Efficient Punctuation Restoration Via Weighted Lookahead Scoring Method for Streaming ASR Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. |
Sungmook Woo; Hyungu Kang; Chanwoo Kim; | arxiv-cs.CL | 2026-04-17 |
| 204 | MUSCAT: MUltilingual, SCientific ConversATion Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. |
SUPRITI SINHAMAHAPATRA et. al. | arxiv-cs.CL | 2026-04-17 |
| 205 | Elderly-Contextual Data Augmentation Via Speech Synthesis for Elderly ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we address data scarcity in EASR through a data augmentation pipeline that combines large language model (LLM)-based transcript paraphrasing with text-to-speech (TTS) synthesis. |
MINSIK LEE et. al. | arxiv-cs.CL | 2026-04-15 |
| 206 | Pushing The Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA’s Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. |
NENAD BANFIC et. al. | arxiv-cs.AI | 2026-04-15 |
| 207 | Diffusion Language Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore variants for their use in speech recognition. |
Davyd Naveriani; Albert Zeyer; Ralf Schlüter; Hermann Ney; | arxiv-cs.CL | 2026-04-15 |
| 208 | Multilingual Conference Audio Transcription, Speaker Diarization And Translation System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional approaches based on manual transcription, speaker identification, and translation are inefficient, costly, and susceptible to errors caused by overlapping speech, accents, background noise, and domain-specific terminology. To overcome these limitations, this work presents an AI-driven framework that integrates Automatic Speech Recognition (ASR), speaker diarization, and Neural Machine Translation (NMT). |
Ms.S Aswini; | International Research Journal on Advanced Engineering Hub … | 2026-04-13 |
| 209 | BlasBench: An Open Benchmark for Irish Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The best open model (omniASR LLM 7B) achieves 30.65% WER on Common Voice and 39.09% on FLEURS. |
Jyoutir Raj; John Conway; | arxiv-cs.CL | 2026-04-12 |
| 210 | From Speech to Profile: A Protocol-Driven LLM Agent for Psychological Profile Generation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the … |
XINGJIAN YANG et. al. | arxiv-cs.SD | 2026-04-11 |
| 211 | Transcribing Children’s Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. |
Gus Lathouwers; Lingyun Gao; Catia Cucchiarini; Helmer Strik; | arxiv-cs.CL | 2026-04-10 |
| 212 | Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. |
PENG WANG et. al. | arxiv-cs.CL | 2026-04-10 |
| 213 | Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. |
Aditya Narayan Sankaran; Reza Farahbakhsh; Noel Crespi; | arxiv-cs.SD | 2026-04-10 |
| 214 | Closing The Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. |
THIBAULT BAÑERAS-ROUX et. al. | arxiv-cs.CL | 2026-04-07 |
| 215 | Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. |
Hanif Rahman; | arxiv-cs.CL | 2026-04-06 |
| 216 | AI-Based Inclusive Assistive And Learning Support Platform For Visually Impaired And Autism Students Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper discusses the design and development of an Artificial Intelligence (AI) enabled platform for assistive and learning support in an inclusive education context. |
Dr. A. Abdul Azeez Khan; Dr. K. Javubar Sathick; Mohamed Aaqil M.H; Pooja .K; | International Research Journal on Advanced Engineering Hub … | 2026-04-03 |
| 217 | Development and Multi-center Evaluation of Domain-adapted Speech Recognition for Human-AI Teaming in Real-world Gastrointestinal Endoscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. |
RUIJIE YANG et. al. | arxiv-cs.CL | 2026-04-02 |
| 218 | CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). |
Youssef Saidi; Haroun Elleuch; Fethi Bougares; | arxiv-cs.CL | 2026-04-02 |
| 219 | Adapting Text LLMs to Speech Via Multimodal Depth Up-Scaling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. |
Kazuki Yano; Jun Suzuki; Shinji Watanabe; | arxiv-cs.CL | 2026-04-01 |
| 220 | Speech LLMs Are Contextual Reasoning Transcribers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM’s textual latent space. |
Keqi Deng; Ruchao Fan; Bo Ren; Yiming Wang; Jinyu Li; | arxiv-cs.CL | 2026-04-01 |
| 221 | A Transformer-Based Method for Bidirectional French–Lingala Machine Translation in Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a deep neural network pipeline for bidirectional French–Lingala automatic translation, covering both text-to-text and voice-to-text scenarios, by integrating Long Short-Term Memory (LSTM) and Transformer models on a specialized parallel corpus. |
REAGAN E. MANDIYA et. al. | Applied Sciences | 2026-03-31 |
| 222 | FLEURS-Kobani: Extending The FLEURS Dataset for Northern Kurdish Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. |
Daban Q. Jaff; Mohammad Mohammadamini; | arxiv-cs.CL | 2026-03-31 |
| 223 | Investigation on The Robustness of Acoustic Foundation Models on Post Exercise Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. |
XIANGYUAN XUE et. al. | arxiv-cs.SD | 2026-03-29 |
| 224 | On The Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we analyze the effects of layer pruning in the Whisper encoder when used as the acoustic backbone of SLAM-ASR. |
Ganesh Pavan Kartikeya Bharadwaj Kolluri; Michael Kampouridis; Ravi Shekhar; | arxiv-cs.CL | 2026-03-29 |
| 225 | Two-Stage Acoustic Adaptation with Gated Cross-Attention Adapters for LLM-Based Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). |
Hao Shi; Yuan Gao; Xugang Lu; Tatsuya Kawahara; | arxiv-cs.SD | 2026-03-28 |
| 226 | Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. |
SHASHI KUMAR et. al. | arxiv-cs.CL | 2026-03-27 |
| 227 | Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an ongoing effort to develop an ASR system for Ikema based on field recordings. |
Chihiro Taguchi; Yukinori Takubo; David Chiang; | arxiv-cs.CL | 2026-03-27 |
| 228 | Back to Basics: Revisiting ASR in The Age of Voice Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. |
GEEYANG TAY et. al. | arxiv-cs.AI | 2026-03-26 |
| 229 | CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. |
Shangkun Huang; Huan Shen; Wei Zou; Yunzhang Chen; | arxiv-cs.SD | 2026-03-26 |
| 230 | Goodness-of-pronunciation Without Phoneme Time Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. |
Jeremy H. M. Wong; Nancy F. Chen; | arxiv-cs.CL | 2026-03-26 |
| 231 | A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. |
Dana Serditova; Kevin Tang; | arxiv-cs.CL | 2026-03-25 |
| 232 | Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. |
BADR M. ABDULLAH et. al. | arxiv-cs.CL | 2026-03-24 |
| 233 | Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. |
NAOHIRO TAWARA et. al. | arxiv-cs.CL | 2026-03-23 |
| 234 | SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. |
Tomer Atia; Yehudit Aperstein; Alexander Apartsin; | arxiv-cs.CL | 2026-03-23 |
| 235 | Precision-Varying Prediction (PVP): Robustifying ASR Systems Against Adversarial Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that changing the precision of an ASR model during inference reduces the likelihood of adversarial attacks succeeding. |
Matías Pizarro; Raghavan Narasimhan; Asja Fischer; | arxiv-cs.LG | 2026-03-23 |
| 236 | Ara-Best-RQ: Multi Dialectal Arabic SSL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. |
Haroun Elleuch; Ryan Whetten; Salima Mdhaffar; Yannick Estève; Fethi Bougares; | arxiv-cs.CL | 2026-03-23 |
| 237 | LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. |
Jianan Chen; Xiaoxue Gao; Tatsuya Kawahara; Nancy F. Chen; | arxiv-cs.CL | 2026-03-20 |
| 238 | Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Personalizing Automatic Speech Recognition (ASR) for non-normative speech remains challenging because data collection is labor-intensive and model training is technically complex. To address these limitations, we propose Adapt4Me, a web-based decentralized environment that operationalizes Bayesian active learning to enable end-to-end personalization without expert supervision. |
Niclas Pokel; Yiming Zhao; Pehuén Moure; Yingqiang Gao; Roman Böhringer; | arxiv-cs.HC | 2026-03-20 |
| 239 | Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM Based Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. |
Yuxiang Mei; Delai Qiu; Shengping Liu; Jiaen Liang; Yanhua Long; | arxiv-cs.CL | 2026-03-18 |
| 240 | RECOVER: Robust Entity Correction Via Agentic Orchestration of Hypothesis Variants for Evidence-based Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. |
Abhishek Kumar; Aashraya Sachdeva; | arxiv-cs.CL | 2026-03-17 |
| 241 | Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on The KIParla Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. |
Martina Simonotti; Ludovica Pannitto; Eleonora Zucchini; Silvia Ballarè; Caterina Mauri; | arxiv-cs.CL | 2026-03-17 |
| 242 | Polyglot-Lion: Efficient Multilingual ASR for Singapore Via Balanced Fine-Tuning of Qwen3-ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. |
Quy-Anh Dang; Chris Ngo; | arxiv-cs.CL | 2026-03-17 |
| 243 | Tagarela – A Portuguese Speech Dataset from Podcasts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. |
FREDERICO SANTOS DE OLIVEIRA et. al. | arxiv-cs.CL | 2026-03-16 |
| 244 | Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a two-stage adaptation framework consisting of speaker-independent fine-tuning (SI-FT) on multi-speaker non-normative data followed by SS-FT, and evaluate it through a controlled comparison with direct SS-FT under identical per-speaker conditions. |
Shan Jiang; Jiawen Qi; Chuanbing Huo; Yingqiang Gao; Qinyu Chen; | arxiv-cs.SD | 2026-03-16 |
| 245 | Lost in Transcription: Subtitle Errors in Automatic Speech Recognition Reduce Speaker and Content Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we examined how subtitle errors affect evaluations of speakers and their content using a preregistered online experiment (N=207, U.S.-based crowdworkers). |
Kowe Kadoma; Priyal Shrivastava; Mor Naaman; | arxiv-cs.HC | 2026-03-16 |
| 246 | Vietnamese Automatic Speech Recognition: A Revisit Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. |
Thi Vu; Linh The Nguyen; Dat Quoc Nguyen; | arxiv-cs.CL | 2026-03-15 |
| 247 | TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. |
Liang-Hsuan Tseng; Hung-yi Lee; | arxiv-cs.CL | 2026-03-12 |
| 248 | Distilling LLM Semantic Priors Into Encoder-Only Multi-Talker ASR with Talker-Count Routing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. |
HAO SHI et. al. | arxiv-cs.SD | 2026-03-11 |
| 249 | Huntington Disease Automatic Speech Recognition with Biomarker Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a systematic HD-ASR study using a high-fidelity clinical speech corpus not previously used for end-to-end ASR training. |
Charles L. Wang; Cady Chen; Ziwei Gong; Julia Hirschberg; | arxiv-cs.LG | 2026-03-11 |
| 250 | Continued Pretraining for Low-Resource Swahili ASR: Achieving State-of-the-Art Performance with Minimal Labeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). |
Hillary Mutisya; John Mugane; | arxiv-cs.SD | 2026-03-11 |
| 251 | Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. |
Yinfeng Xia; Jian Tang; Junfeng Hou; Gaopeng Xu; Haitao Yao; | arxiv-cs.SD | 2026-03-11 |
| 252 | Duration Aware Scheduling for ASR Serving Under Workload Drift Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Scheduling policies in large-scale Automatic Speech Recognition (ASR) serving pipelines play a key role in determining end-to-end (E2E) latency. |
Darshan Makwana; Yash Jogi; Harsh Kotta; Aayush Kubba; | arxiv-cs.LG | 2026-03-11 |
| 253 | AlphaFlowTSE: One-Step Generative Target Speaker Extraction Via Conditional AlphaFlow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. |
DUOJIA LI et. al. | arxiv-cs.SD | 2026-03-11 |
| 254 | SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. |
Laya Iyer; Angelina Wang; Sanmi Koyejo; | arxiv-cs.SD | 2026-03-10 |
| 255 | Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The paper highlights the challenges encountered and provides directions for future work. |
Rania Al-Sabbagh; | arxiv-cs.CL | 2026-03-09 |
| 256 | Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. |
RISHIKESH KUMAR SHARMA et. al. | arxiv-cs.CL | 2026-03-08 |
| 257 | Seeing The Context: Rich Visual Context-Aware Speech Recognition Via Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to see and reason the visual context to improve speech recognition. |
WENJIE TIAN et. al. | arxiv-cs.SD | 2026-03-07 |
| 258 | The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. |
Hanlong Li; Karishma Kamalahasan; Jiahui Li; Kazuhiro Nakadai; Shreyas Kousik; | arxiv-cs.RO | 2026-03-07 |
| 259 | Speak in Context: Multilingual ASR with Speech Context Alignment Via Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. |
Yuchen Zhang; Haralambos Mouratidis; Ravi Shekhar; | arxiv-cs.CL | 2026-03-06 |
| 260 | Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. |
SHREYAS GOPAL et. al. | arxiv-cs.CL | 2026-03-06 |
| 261 | Recognition of Speech Using Deep Learning-based Optimization Technique Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, a new technique called Fractional Super Bilterling Optimization with Random Multimodal Deep Learning-Convolutional Neural Network (FrSBO_RMDL-CNN) is proposed for speech recognition from unclear speech signals. |
Raja Bhargava; N. Arivazhagan; Kunchala Suresh babu; | International Journal of Wavelets, Multiresolution and … | 2026-03-06 |
| 262 | Beyond Word Error Rate: Auditing The Diversity Tax in Speech Recognition Through Dataset Cartography Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. |
Ting-Hui Cheng; Line H. Clemmensen; Sneha Das; | arxiv-cs.LG | 2026-03-05 |
| 263 | Boosting ASR Robustness Via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. |
Linghan Fang; Tianxin Xie; Li Liu; | arxiv-cs.SD | 2026-03-05 |
| 264 | Measuring The Redundancy of Decoder Layers in SpeechLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. |
Adel Moumen; Guangzhi Sun; Philip C Woodland; | arxiv-cs.CL | 2026-03-05 |
| 265 | Which Data Matter? Embedding-Based Data Selection for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. |
ZAKARIA ALDENEH et. al. | arxiv-cs.SD | 2026-03-05 |
| 266 | Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. |
MENGZE HONG et. al. | arxiv-cs.CL | 2026-03-05 |
| 267 | TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. |
HAO-HUI XIE et. al. | arxiv-cs.SD | 2026-03-05 |
| 268 | FLAMA: Frame-Level Alignment Margin Attack for Scene Text and Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, decisive frames are treated implicitly, and optimization can become unnecessarily diffuse over long input sequences, hindering convergence and perceptual quality. To address the above issues, we propose FLAMA, a unified Frame-Level Alignment Margin Attack, which could be used for both STR and ASR models. |
Yikun Xu; Zhiheng Xu; Pengwen Dai; | Electronics | 2026-03-04 |
| 269 | When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. |
Akif Islam; Raufun Nahar; Md. Ekramul Hamid; | arxiv-cs.SD | 2026-03-04 |
| 270 | WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). |
Aurchi Chowdhury; Rubaiyat -E-Zaman; Sk. Ashrafuzzaman Nafees; | arxiv-cs.SD | 2026-03-04 |
| 271 | Speech Recognition Assisted By Large Language Models to Command Software Orally — Application to An Augmented and Virtual Reality Web App for Immersive Molecular Graphics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This project successfully developed, evaluated and integrated a Voice User Interface (VUI) into a web application that we are developing for immersive molecular graphics. |
Fabio Cortes Rodriguez; Luciano Abriata; | arxiv-cs.HC | 2026-03-03 |
| 272 | An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a multistage approach developed for the DL Sprint 4.0 – Bengali Long-Form Speech Recognition and DL Sprint 4.0 – Bengali Speaker Diarization competitions on Kaggle, addressing the challenge of who spoke when/what in hour-long recordings. |
Epshita Jahan; Khandoker Md Tanjinul Islam; Pritom Biswas; Tafsir Al Nafin; | arxiv-cs.SD | 2026-03-03 |
| 273 | Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce two conditions under which unsupervised speech recognition is possible. |
Zijian Yang; Jörg Barkoczi; Ralf Schlüter; Hermann Ney; | arxiv-cs.SD | 2026-03-02 |
| 274 | From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate whether cross-model disagreement among heterogeneous ASR systems can act as a reference-free uncertainty signal to prioritize human verification in medical transcription workflows. |
Abdolamir Karbalaie; Fernando Seoane; Farhad Abtahi; | arxiv-cs.SD | 2026-03-02 |
| 275 | GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. |
Pouya Mehralian; Melissa Farasyn; Anne Breitbarth; Anne-Sophie Ghyselen; Hugo Van hamme; | arxiv-cs.CL | 2026-03-02 |
| 276 | VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. |
LOAN DO et. al. | arxiv-cs.SD | 2026-03-02 |
| 277 | RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. |
Alexandra Diaconu; Mădălina Vînaga; Bogdan Alexe; | arxiv-cs.CL | 2026-03-02 |
| 278 | DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. |
MINGHUI WU et. al. | arxiv-cs.SD | 2026-03-01 |
| 279 | End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. |
Minghui Wu; Haitao Tang; Jiahuan Fan; Ruizhi Liao; Yanyong Zhang; | arxiv-cs.SD | 2026-03-01 |
| 280 | ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: ASR systems exhibit persistent performance disparities across accents, yet the internal mechanisms underlying these gaps remain poorly understood. We introduce ACES, a representation-centric audit that extracts accent-discriminative subspaces and uses them to probe model fragility and disparity. |
Swapnil Parekh; | arxiv-cs.SD | 2026-02-28 |
| 281 | Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. |
YU-SIANG LAN et. al. | arxiv-cs.CL | 2026-02-26 |
| 282 | A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. |
Zarif Ishmam; Zarif Mahir; Shafnan Wasif; Md. Ishtiak Moin; | arxiv-cs.SD | 2026-02-26 |
| 283 | Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization Via Extreme Augmentation and Perfect Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. |
Sanjid Hasan; Risalat Labib; A H M Fuad; Bayazid Hasan; | arxiv-cs.SD | 2026-02-26 |
| 284 | Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Results show that ASR errors are significantly higher for individuals with dementia, revealing a critical usability gap. To better understand these disparities, we conducted an acoustic analysis of speech features and found that a speaker’s intensity, voice quality, and pause ratio predicted ASR accuracy. |
MICHELLE COHN et. al. | arxiv-cs.HC | 2026-02-26 |
| 285 | ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). |
Md. Nazmus Sakib; Shafiul Tanvir; Mesbah Uddin Ahamed; H. M. Aktaruzzaman Mukdho; | arxiv-cs.CL | 2026-02-25 |
| 286 | Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. |
MD. Sagor Chowdhury; Adiba Fairooz Chowdhury; | arxiv-cs.CL | 2026-02-25 |
| 287 | Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). |
AN-CI PENG et. al. | arxiv-cs.CL | 2026-02-25 |
| 288 | Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents and evaluates an optimized cascaded Nepali speech-to-English text translation (S2TT) system, focusing on mitigating structural noise introduced by Automatic Speech Recognition (ASR). |
Tangsang Chongbang; Pranesh Pyara Shrestha; Amrit Sarki; Anku Jaiswal; | arxiv-cs.CL | 2026-02-25 |
| 289 | 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present frameworks for long form Bengali speech intelligence that address automatic speech recognition using a Whisper Medium based model and speaker diarization using a finetuned segmentation model. |
Ratnajit Dhar; Arpita Mallik; | arxiv-cs.SD | 2026-02-24 |
| 290 | ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition Via Audio Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Semi-supervised learning in automatic speech recognition (ASR) typically relies on pseudo-labeling, which often suffers from confirmation bias and error accumulation due to noisy supervision. To address this limitation, we propose ReHear, a framework for iterative pseudo-label refinement that integrates an instruction-tuned, audio-aware large language model (LLM) into the self-training loop. |
Zefang Liu; Chenyang Zhu; Sangwoo Cho; Shi-Xiong Zhang; | arxiv-cs.CL | 2026-02-21 |
| 291 | Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining. |
Yonathan Ron; Shiri Gilboa; Tammuz Dubnov; | arxiv-cs.CL | 2026-02-21 |
| 292 | The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR→LLM Pipelines? Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech LLMs are widely understood to be better than ASR$\rightarrow$LLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we … |
Jayadev Billa; | ArXiv | 2026-02-19 |
| 293 | The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. |
Jayadev Billa; | arxiv-cs.CL | 2026-02-19 |
| 294 | AI-Driven Voice Translator for Cross-Language Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our solution uses CTC, GRU, and Parallel WaveGAN. |
GEENU RAMU; P.KALYAN CHAKRAVARTHI; | International Journal of Scientific Research in Engineering … | 2026-02-18 |
| 295 | Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in English and Russian, drawn from diverse scientific domains. |
DMITRII KORZH et. al. | iclr | 2026-02-17 |
| 296 | SENS-ASR: Semantic Embedding Injection in Neural-transducer for Streaming Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. |
Youness Dkhissi; Valentin Vielzeuf; Elys Allesiardo; Anthony Larcher; | arxiv-cs.CL | 2026-02-17 |
| 297 | Confident and Adaptive Generative Speech Recognition Via Conformal Risk Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an adaptive framework that dynamically determines the optimal number of hypotheses for each input using risk control. |
Amit Damri; Bracha Laufer-Goldshtein; | iclr | 2026-02-17 |
| 298 | CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC’s scaling issues. |
MARTIJN BARTELDS et. al. | iclr | 2026-02-17 |
| 299 | Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. |
H. M. SHADMAN TABIB et. al. | arxiv-cs.SD | 2026-02-15 |
| 300 | From Scarcity to Scale: A Release-Level Analysis of The Pashto Common Voice Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. |
Jandad Jahani; Mursal Dawodi; Jawid Ahmad Baktash; | arxiv-cs.CL | 2026-02-15 |
| 301 | Voice2mode: Phonation Mode Classification in Singing Using Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. |
Aju Ani Justus; Ruchit Agrawal; Sudarsana Reddy Kadiri; Shrikanth Narayanan; | arxiv-cs.SD | 2026-02-14 |
| 302 | ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. |
TUNG X. NGUYEN et. al. | arxiv-cs.CL | 2026-02-13 |
| 303 | Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models Without Forgetting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. |
Jing Xu; Minglin Wu; Xueyuan Chen; Xixin Wu; Helen Meng; | arxiv-cs.CL | 2026-02-13 |
| 304 | Towards Explainable Reference-free Speech Intelligibility Evaluation of People with Pathological Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing objective assessments, particularly reference-based approaches, can capture intelligibility changes, they are often hindered by lack of explainability and the need for labor-intensive manual transcriptions. To address these issues, this work proposes the reference-free, explainable ASR Inconsistency Score. |
Bence Mark Halpern; Thomas Tienkamp; Defne Abur; Thomas Tienkamp; | arxiv-cs.SD | 2026-02-13 |
| 305 | Moonshine V2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. |
Manjunath Kudlur; Evan King; James Wang; Pete Warden; | arxiv-cs.CL | 2026-02-12 |
| 306 | ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). |
Khoa Anh Nguyen; Long Minh Hoang; Nghia Hieu Nguyen; Luan Thanh Nguyen; Ngan Luu-Thuy Nguyen; | arxiv-cs.CL | 2026-02-10 |
| 307 | Where Are We At with Automatic Speech Recognition for The Bambara Language? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. |
SEYDOU DIALLO et. al. | arxiv-cs.CL | 2026-02-10 |
| 308 | Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. |
HAOSHEN WANG et. al. | arxiv-cs.SD | 2026-02-09 |
| 309 | Automated Legal Document Drafting in The Indian Judicial System: A Survey of AI-driven Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The research states that we have done evaluation of 20 academic papers from 2017 to 2025 which aims to investigate Automatic Speech Recognition (ASR), Natural Language Processing (NLP) and transfer learning models for legal systems. |
Hemamalini S; | International Journal for Research in Applied Science and … | 2026-02-06 |
| 310 | Speech Emotion Recognition Leveraging OpenAI’s Whisper Representations and Attentive Pooling Methods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. |
Ali Shendabadi; Parnia Izadirad; Mostafa Salehi; Mahmoud Bijankhan; | arxiv-cs.AI | 2026-02-05 |
| 311 | Speaker-Aware Simulation Improves Conversational Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. |
Máté Gedeon; Péter Mihajlik; | arxiv-cs.SD | 2026-02-04 |
| 312 | Frontend Token Enhancement for Token-Based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. |
Takanori Ashihara; Shota Horiguchi; Kohei Matsuura; Tsubasa Ochiai; Marc Delcroix; | arxiv-cs.SD | 2026-02-04 |
| 313 | An ASR Transformer-Based Model for Kannada Speech-to-Text Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a dialect-aware and noise-robust Kannada automatic speech recognition (ASR) system that bridges the gap between low-resource linguistic contexts and state-of-the-art deep learning models. |
Chandrika Prasad; Veenga Gode Swamy Rao; Geetha J; R China Appala Naidu; | Journal of Artificial Intelligence and Technology | 2026-02-03 |
| 314 | Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. |
Ligong Lei; Wenwen Lu; Xudong Pang; Zaokere Kadeer; Aishan Wumaier; | arxiv-cs.CL | 2026-02-03 |
| 315 | BBPE16: UTF-16-based Byte-level Byte-pair Encoding for Improved Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. |
Hyunsik Kim; Haeri Kim; Munhak Lee; Kyungmin Lee; | arxiv-cs.CL | 2026-02-02 |
| 316 | Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. |
Wonjun Lee; Hyounghun Kim; Gary Geunbae Lee; | arxiv-cs.CL | 2026-02-02 |
| 317 | A Mixed Reality Tool with Automatic Speech Recognition for 3D CAD Based Visualization and Automatic Dimension Generation in The Industry 5.0 Shipyard Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Industry 5.0 is composed of a variety of complex tasks and challenging processes requiring specialized labor and multidisciplinary coordination. Specifically, when it comes to … |
Aida Vidal-Balea; Antón Valladares-Poncela; Javier Vilar-Martínez; T. Fernández-Caramés; Paula Fraga-Lamas; | Multimodal Technol. Interact. | 2026-02-01 |
| 318 | MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. |
YUTONG SONG et. al. | arxiv-cs.CL | 2026-01-31 |
| 319 | Qwen3-ASR Technical Report Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. |
XIAN SHI et. al. | arxiv-cs.CL | 2026-01-29 |
| 320 | Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. |
June-Woo Kim; Dhruv Agarwal; Federica Cerina; | arxiv-cs.SD | 2026-01-29 |
| 321 | A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. |
Ryan Whetten; Titouan Parcollet; Marco Dinarelli; Yannick Estève; | arxiv-cs.SD | 2026-01-28 |
| 322 | Text-only Adaptation in LLM-based ASR Through Text Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. |
SERGIO BURDISSO et. al. | arxiv-cs.SD | 2026-01-28 |
| 323 | SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. |
Manali Sharma; Riya Naik; Buvaneshwari G; | arxiv-cs.SD | 2026-01-27 |
| 324 | Advanced Modeling of Interlanguage Speech Intelligibility Benefit with L1-L2 Multi-Task Learning Using Differentiable K-Means for Accent-Robust Discrete Token-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a more advanced modeling of ISIB. |
Kentaro Onda; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2026-01-27 |
| 325 | Mind The Shift: Using Delta SSL Embeddings to Enhance Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. |
Zilai Wang; Natarajan Balaji Shankar; Kaiyuan Zhang; Zihan Wang; Abeer Alwan; | arxiv-cs.CL | 2026-01-27 |
| 326 | Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. |
Isha Pandey; Ashish Mittal; Vartul Bahuguna; Ganesh Ramakrishnan; | arxiv-cs.CL | 2026-01-27 |
| 327 | Unheard in The Digital Age: Rethinking AI Bias and Speech Diversity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given the current state of the debate, this article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence. |
Onyedikachi Hope Amaechi-Okorie; Branislav Radeljic; | arxiv-cs.HC | 2026-01-26 |
| 328 | VIBEVOICE-ASR Technical Report Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. |
ZHILIANG PENG et. al. | arxiv-cs.SD | 2026-01-26 |
| 329 | Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. |
Yuchen Zhang; Ravi Shekhar; Haralambos Mouratidis; | arxiv-cs.CL | 2026-01-26 |
| 330 | OCR-Enhanced Multimodal ASR Can Read While Listening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. |
Junli Chen; Changli Tang; Yixuan Li; Guangzhi Sun; Chao Zhang; | arxiv-cs.SD | 2026-01-26 |
| 331 | DLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose dLLM-ASR, an efficient dLLM-based ASR framework that formulates dLLM’s decoding as a prior-guided and adaptive denoising process. |
WENJIE TIAN et. al. | arxiv-cs.SD | 2026-01-25 |
| 332 | Post-ASR Correction for Low-Resource Rajasthani Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a multi-view 1 , character-level sequence-to-sequence (Seq2Seq) model that uses a gated fusion mechanism to dynamically weigh information from the two ASR outputs. |
Abhishek Bhandari; Gaurav Harit; | ACM Transactions on Asian and Low-Resource Language … | 2026-01-24 |
| 333 | BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT, designed to address these challenges. |
Md Sazzadul Islam Ridoy; Mubaswira Ibnat Zidney; Sumi Akter; Md. Aminur Rahman; | arxiv-cs.SD | 2026-01-24 |
| 334 | Benchmarking Von ASR-Modellen Im Deutschen Medizinischen Kontext: Eine Leistungsanalyse Anhand Von Anamnesegesprächen Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. |
THOMAS SCHUSTER et. al. | arxiv-cs.CL | 2026-01-23 |
| 335 | Sink or SWIM: Tackling Real-Time ASR at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI’s Whisper model that enables true model-level parallelization for scalable, multilingual transcription. |
Federico Bruzzone; Walter Cazzola; Matteo Brancaleoni; Dario Pellegrino; | arxiv-cs.SD | 2026-01-22 |
| 336 | Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. |
JUNJIE AN et. al. | arxiv-cs.CL | 2026-01-21 |
| 337 | SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. |
HAN YIN et. al. | aaai | 2026-01-20 |
| 338 | IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. |
JIAJIE ZHU et. al. | aaai | 2026-01-20 |
| 339 | Beyond Transcription: Mechanistic Interpretability in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. |
NETA GLAZER et. al. | aaai | 2026-01-20 |
| 340 | STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People Who Stutter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. |
ZIQI XU et. al. | aaai | 2026-01-20 |
| 341 | BiCycle: Group-wise Recursive Transformer Based on ASR Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While RT has been successfully applied to large language models (LLMs), its effectiveness in automatic speech recognition (ASR) remains limited, despite the parallel trend of model scaling in the speech domain. In this paper, we reveal that conventional RT designs for LLMs are suboptimal for speech recognition, primarily because they do not fully consider the layer-wise specialization inherent in the ASR architecture, where lower layers focus on phonetic features and upper layers capture linguistic localization. |
Min Ho Jang; Eun Seo Seo; Jin Young Kim; Hyeongsoo Lim; Ji Won Yoon; | aaai | 2026-01-20 |
| 342 | Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Voice-based interfaces offer a more inclusive modality, but spoken queries involving code present unique challenges due to the presence of non-standard English usage, domain-specific vocabulary, and custom identifiers such as variable and function names, often combined with code-mixed expressions. In this work, we develop a multilingual speech-driven framework for code understanding that accepts spoken queries in a user native language, transcribes them using Automatic Speech Recognition (ASR), applies code-aware ASR output refinement using Large Language Models (LLMs), and interfaces with code models to perform tasks such as code question answering and code retrieval through benchmarks such as CodeSearchNet, CoRNStack, and CodeQA. |
Jayant Havare; Ashish Mittal; Srikanth Tamilselvam; Ganesh Ramakrishnan; | arxiv-cs.SE | 2026-01-20 |
| 343 | HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a new benchmark for fully evaluating the human-level perceptual and understanding capabilities of Speech LLMs. |
CHEN LI et. al. | aaai | 2026-01-20 |
| 344 | WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-dimensional Annotation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. |
LONGHAO LI et. al. | aaai | 2026-01-20 |
| 345 | Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. |
Bingshen Mu; Hexin Liu; Hongfei Xue; Kun Wei; Lei Xie; | aaai | 2026-01-20 |
| 346 | Speech Recognition Model Improves Text-to-Speech Synthesis Using Fine-Grained Reward Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this context, we present an intriguing finding: encoder-decoder ASR models, such as Whisper, leverage their extensive pre-training to precisely capture word-level mismatches between speech and text within their cross-attention mechanisms, thereby providing a fine-grained reward signal. Building upon this insight, we propose a novel TTS optimization method, which we term Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). |
Guansu Wang; Peijie Sun; | aaai | 2026-01-20 |
| 347 | Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP^2 method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. |
YIMING RONG et. al. | aaai | 2026-01-20 |
| 348 | Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. |
Peter Sullivan; AbdelRahim Elmadany; Alcides Alcoba Inciarte; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2026-01-19 |
| 349 | Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. |
WARIT SIRICHOTEDUMRONG et. al. | arxiv-cs.CL | 2026-01-19 |
| 350 | DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To fill this gap, we first utilize gradient analysis to reveal that ASR and SR exhibit no inherent conflicts. Building on this, we propose Dual-task Universal Adversarial Perturbation (DUAP). |
Suyang Sun; Weifei Jin; Yuxin Cao; Wei Song; Jie Hao; | arxiv-cs.CR | 2026-01-19 |
| 351 | SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. |
Pu Wang; Shinji Watanabe; Hugo Van hamme; | arxiv-cs.SD | 2026-01-18 |
| 352 | CTC-DID: CTC-Based Arabic Dialect Identification for Streaming Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). |
Muhammad Umar Farooq; Oscar Saz; | arxiv-cs.CL | 2026-01-17 |
| 353 | WenetSpeech-Wu: Datasets, Benchmarks, and Models for A Unified Chinese Wu Dialect Speech Processing Ecosystem Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. |
CHENGYOU WANG et. al. | arxiv-cs.SD | 2026-01-16 |
| 354 | Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. |
Runyuan Cai; Yu Lin; Yiming Wang; Chunlin Fu; Xiaodong Zeng; | arxiv-cs.SD | 2026-01-15 |
| 355 | Acoustic and Tonal Modeling of The Tpuri Language Through A Multi-Modular Hybrid Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a hybrid multimodular ASR architecture for tpuri, a Mboum-Day Niger-Congo language spoken in Cameroon and Chad that exhibits contrastive lexical tone, vowel length and nasalisation. |
Jules Paulin Bayang Souloukna; Patrick Nounamou Dabou; Paul Dabou Patrick; | EAI Endorsed Transactions on Intelligent Systems and … | 2026-01-14 |
| 356 | MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). |
YEXING DU et. al. | arxiv-cs.CL | 2026-01-14 |
| 357 | SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. |
ZIYANG MA et. al. | arxiv-cs.SD | 2026-01-14 |
| 358 | Wav2Vec-based Audio Data Augmentation for Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article focuses on performing ADA techniques namely the addition of noise, pitch shifting, increasing or decreasing of speed and adding reverberation to the audio signals. |
P. Haritha; P. Shanmugavadivu; | International Research Journal on Advanced Engineering Hub … | 2026-01-13 |
| 359 | Robust CAPTCHA Using Audio Illusions in The Era of Large Language Models: from Evaluation to Advances Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce AI-CAPTCHA, a unified framework that offers (i) an evaluation framework, ACEval, which includes advanced LALM- and ASR-based solvers, and (ii) a novel audio CAPTCHA approach, IllusionAudio, leveraging audio illusions. |
ZIQI DING et. al. | arxiv-cs.SD | 2026-01-13 |
| 360 | SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. |
TIANYI XU et. al. | arxiv-cs.CL | 2026-01-13 |
| 361 | Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. |
Kalvin Chang; Yiwen Shao; Jiahong Li; Dong Yu; | arxiv-cs.CL | 2026-01-12 |
| 362 | Task Arithmetic with Support Languages for Low-Resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we consider training on a particular language to be a task, and we generate task vectors by fine-tuning variants of the Whisper ASR system. |
Emma Rafkin; Dan DeGenaro; Xiulin Yang; | arxiv-cs.CL | 2026-01-11 |
| 363 | Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. |
Ayman Mansour; | arxiv-cs.CL | 2026-01-11 |
| 364 | Multimodal In-context Learning for ASR of Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. |
Zhaolin Li; Jan Niehues; | arxiv-cs.CL | 2026-01-09 |
| 365 | Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. |
Akriti Dhasmana; Aarohi Srivastava; David Chiang; | arxiv-cs.CL | 2026-01-07 |
| 366 | WESR: Scaling and Evaluating Word-level Event-Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. |
CHENCHEN YANG et. al. | arxiv-cs.CL | 2026-01-07 |
| 367 | Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing solutions like Quantization Error Propagation (QEP) are suboptimal for ASR due to the model’s heterogeneity, processing acoustic features in the encoder while generating text in the decoder. To address this, we propose Fine-grained Alpha for Dynamic Quantization Error Propagation (FADE), which adaptively controls the trade-off between cross-layer error correction and local quantization. |
XINYU WANG et. al. | arxiv-cs.SD | 2026-01-05 |
| 368 | Multi-channel Multi-speaker Transformer for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. |
Guo Yifan; Tian Yao; Suo Hongbin; Wan Yulong; | arxiv-cs.SD | 2026-01-05 |
| 369 | Bridging The Gap: A Comparative Exploration of Speech-LLM and End-to-end Architecture for Multilingual Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. |
Yuxiang Mei; Dongxing Xu; Jiaen Liang; Yanhua Long; | arxiv-cs.CL | 2026-01-04 |
| 370 | IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. |
JIAJIE ZHU et. al. | arxiv-cs.SD | 2026-01-03 |
| 371 | The Human–machine Correlation in Automated Speech Evaluation: A Three-level Meta-analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moderator analysis revealed significant moderating effects on the overall human–machine correlation from 10 variables: publication year, publication type, unit of sample, age group, level of task constraints, rater expertise, inter-rater reliability, system developer, feature engineering , and algorithm type . |
Yanxin Wang; Shangchao Min; | Language Testing | 2026-01-03 |
| 372 | AfriVox: Probing Multilingual and Accent Robustness of Speech LLMs Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advances in multimodal and speech-native large language models (LLMs) have delivered impressive speech recognition, translation, understanding, and question-answering … |
BUSAYO AWOBADE et. al. | Conference of the European Chapter of the Association for … | 2026-01-01 |
| 373 | Index-ASR Technical Report Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. |
ZHESHU SONG et. al. | arxiv-cs.SD | 2025-12-31 |
| 374 | Fine-Tuning Whisper Model for Mandar Speech Recognition: Approach and Performance Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study aims to enhance the performance of Automatic Speech Recognition (ASR) systems by fine-tuning the Whisper model using a Mandar-specific dataset. |
Jafar Jafar; Mar Athul Wazithah Tb; Firman Aziz; Rosary Iriany; Norma Nasir; | Journal of Applied Engineering and Technological Science … | 2025-12-29 |
| 375 | PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. |
Deepak Babu Piskala; | arxiv-cs.CL | 2025-12-29 |
| 376 | Kunnafonidilaw Ka Cadeau: An ASR Dataset of Present-day Bambara Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. |
Yacouba Diarra; Panga Azazia Kamate; Nouhoum Souleymane Coulibaly; Michael Leventhal; | arxiv-cs.CL | 2025-12-22 |
| 377 | From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a case study on developing a professional subtitling system for an Italian media company. |
Alessandro Lucca; Francesco Pierri; | arxiv-cs.CL | 2025-12-22 |
| 378 | Automated Speech-fluency Explanations for Schizophrenia Diagnosis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we present a fully automated and explainable pipeline for detecting schizophrenia from audio recordings of verbal fluency tests, collected from 126 Slovene-speaking participants (68 healthy controls, 58 individuals diagnosed with schizophrenia), leveraging recent advancements in automatic speech recognition (ASR) and large language model (LLM) systems. |
Rok Rajher; Mila Marinković; Polona Rus Prelog; Jure Žabkar; | Scientific Reports | 2025-12-22 |
| 379 | X-Talk: On The Underestimated Potential of Modular Speech-to-Speech Dialogue System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-to-speech (S2S) systems. |
ZHANXUN LIU et. al. | arxiv-cs.SD | 2025-12-21 |
| 380 | Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. |
Zahra Rahmani; Hossein Sameti; | arxiv-cs.CL | 2025-12-19 |
| 381 | Peeking Into The Future For Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. |
Ramaneswaran Selvakumar; Cindy Tseng; Eesung Kim; Vijendra Raj Apsingekar; Yun Tang; | arxiv-cs.CL | 2025-12-19 |
| 382 | When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. |
SUJAL CHONDHEKAR et. al. | arxiv-cs.SD | 2025-12-19 |
| 383 | Bridging The Reality Gap: Efficient Adaptation of ASR Systems for Challenging Low-Resource Domains Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. |
DARSHIL CHAUHAN et. al. | arxiv-cs.CL | 2025-12-18 |
| 384 | Voice Assistant for Desktop Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the design and implementation of a Voice Assistant for Desktop, an intelligent system that enables users to interact with desktop computers using natural language voice commands. |
Geeta Patil; | International Journal for Research in Applied Science and … | 2025-12-18 |
| 385 | Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge is amplified for modern Large Language Model (LLM)-based ASR systems, whose massive scale and complex training dynamics make effective fine-tuning non-trivial. To address this gap, this paper proposes a principled and metric-driven fine-tuning framework for adapting both traditional and LLM-based ASR models to specialized domains. |
XUANFAN NI et. al. | arxiv-cs.SD | 2025-12-17 |
| 386 | Adapting Speech Language Model to Singing Voice Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. |
YIWEN ZHAO et. al. | arxiv-cs.SD | 2025-12-16 |
| 387 | Reproducing and Dissecting Denoising Language Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses, which consistently outperforms the previously proposed DSR decoding method. |
Dorian Koch; Albert Zeyer; Nick Rossenbach; Ralf Schlüter; Hermann Ney; | arxiv-cs.NE | 2025-12-15 |
| 388 | Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between … |
Mohammad Jalili Torkamani; Israt Zarin; | ArXiv | 2025-12-14 |
| 389 | Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models (ASTA) Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. |
Mohammad Jalili Torkamani; Israt Zarin; | arxiv-cs.SD | 2025-12-14 |
| 390 | System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the design, implementation, and in-situ deployment of a smartphone-based voice-enabled AI system for generating electronic medical records (EMRs) and clinical risk alerts in maternal healthcare settings. |
MARYAM MUSTAFA et. al. | arxiv-cs.HC | 2025-12-13 |
| 391 | Leveraging 5G Networks for Event Video Summarization Via Multimodal Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an end-to-end framework that leverages 5G connectivity, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), and Large Language Models (LLMs) to generate concise, context-aware event summaries in the form of video highlights and textual overviews, eliminating the need to watch full videos. |
ALEXANDROS VROCHIDIS et. al. | Discover Artificial Intelligence | 2025-12-12 |
| 392 | TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. |
Elroy Galbraith; Chadwick Sutherland; Donahue Morgan; | arxiv-cs.CL | 2025-12-11 |
| 393 | Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. |
Srihari Bandarupalli; Bhavana Akkiraju; Charan Devarakonda; Vamsiraghusimha Narsinga; Anil Kumar Vuppala; | arxiv-cs.CL | 2025-12-08 |
| 394 | The Development and Experimental Evaluation of A Multilingual Speech Corpus for Low-Resource Turkic Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. |
AIDANA KARIBAYEVA et. al. | Applied Sciences | 2025-12-05 |
| 395 | Balancing Specialization and Generalization Trade-Off for Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In some scenarios, we are interested in optimizing performance for the target domain (specialization)while preserving the general capabilities of the pretrained model. In this work, we study this effect for various finetuning strategies that aim to preserve pretrained model capabilities. |
Sebastian Cygert; Piotr Despot-Mładanowicz; Andrzej Czyżewski; | Electronics | 2025-12-05 |
| 396 | Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This research presents a novel approach to enhancing automatic speech recognition systems by integrating noise detection capabilities directly into the recognition architecture. |
Karamvir Singh; | arxiv-cs.SD | 2025-12-02 |
| 397 | Spoken Conversational Agents with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. |
Chao-Han Huck Yang; Andreas Stolcke; Larry Heck; | arxiv-cs.CL | 2025-12-02 |
| 398 | Swivuriso: The South African Next Voices Multilingual Speech Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. |
VUKOSI MARIVATEE et. al. | arxiv-cs.CL | 2025-12-01 |
| 399 | ASR Under The Stethoscope: Evaluating Biases in Clinical Speech Recognition Across Indian Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. |
SUBHAM KUMAR et. al. | arxiv-cs.CL | 2025-11-30 |
| 400 | Benchmarking Automatic Speech Recognition Models for African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. |
ALVIN NAHABWE et. al. | arxiv-cs.CL | 2025-11-30 |
| 401 | ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models Without Back-Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. |
YUEZHANG PENG et. al. | arxiv-cs.MM | 2025-11-30 |
| 402 | Deep Learning Techniques for Hindi Automatic Speech Recognition: A Comprehensive Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study examines multiple models on publicly available speech datasets to evaluate their performance for practical implementation. |
Hetal Gaudani; Narendra M Patel; | International Journal of Latest Technology in Engineering … | 2025-11-28 |
| 403 | Scaling HuBERT for African Languages: From Base to Large and XL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets. |
Antoine Caubrière; Elodie Gauthier; | arxiv-cs.CL | 2025-11-28 |
| 404 | Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on The Loquacious Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. |
NICK ROSSENBACH et. al. | arxiv-cs.CL | 2025-11-27 |
| 405 | ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers Using Phonetic Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. |
Ye Bhone Lin; Thura Aung; Ye Kyaw Thu; Thazin Myint Oo; | arxiv-cs.CL | 2025-11-26 |
| 406 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. |
JIONGHAO HAN et. al. | arxiv-cs.SD | 2025-11-25 |
| 407 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. |
Bashar Talafha; Amin Abu Alhassan; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2025-11-24 |
| 408 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AIRHILT (Aviation Integrated Reasoning, Human-in-the-Loop Testbed), a modular and lightweight simulation environment designed to evaluate multimodal pilot and air traffic control (ATC) assistance systems for aviation conflict detection. |
Omar Garib; Jayaprakash D. Kambhampaty; Olivia J. Pinon Fischer; Dimitri N. Mavris; | arxiv-cs.RO | 2025-11-23 |
| 409 | A Multimodal Conversational Agent for Tabular Data Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. |
Mohammad Nour Al Awad; Sergey Ivanov; Olga Tikhonova; Ivan Khodnenko; | arxiv-cs.AI | 2025-11-23 |
| 410 | Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts with metadata like persona profiles and pragmatic action tags (e.g., [propose_motion]). |
Scott Merrill; Shashank Srivastava; | arxiv-cs.CL | 2025-11-21 |
| 411 | WER Is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. |
ZACHARY ELLIS et. al. | arxiv-cs.CL | 2025-11-20 |
| 412 | Smart Voice Assistant Using Machine Learning and Deep Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This project focuses on understanding the architecture, working principles, applications, and challenges of Smart Voice Assistants. |
Gorelal Verma; Deepesh Dewangan; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2025-11-19 |
| 413 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. |
Gabrial Zencha Ashungafac; Mardhiyah Sanni; Busayo Awobade; Alex Gichamba; Tobi Olatunji; | arxiv-cs.CL | 2025-11-18 |
| 414 | Toward Conversational Hungarian Speech Recognition: Introducing The BEA-Large and BEA-Dialogue Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages. |
MÁTÉ GEDEON et. al. | arxiv-cs.CL | 2025-11-17 |
| 415 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. |
Zaara Zabeen Arpa; Sadnam Sakib Apurbo; Nazia Karim Khan Oishee; Ajwad Abrar; | arxiv-cs.CL | 2025-11-17 |
| 416 | ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes Elyadata \& LIA’s joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. |
Haroun Elleuch; Youssef Saidi; Salima Mdhaffar; Yannick Estève; Fethi Bougares; | arxiv-cs.CL | 2025-11-13 |
| 417 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most–all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. |
OMNILINGUAL ASR TEAM et. al. | arxiv-cs.CL | 2025-11-12 |
| 418 | H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. |
HUANGYU DAI et. al. | cikm | 2025-11-10 |
| 419 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, this E2E speech synthesis also requiresnew security mechanisms. To tackle these challenges, we propose E2E-VGuard, aproactive defense framework for two emerging threats: (1) production LLM-basedspeech synthesis, and (2) the novel attack arising from ASR-driven E2Escenarios. |
ZHISHENG ZHANG et. al. | arxiv-cs.SD | 2025-11-10 |
| 420 | CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, prior studies have rarely exploredstaged strategies that integrate both annotation types. To address this gap, wepresent CLiFT-ASR, a cross-lingual fine-tuning framework that builds onMandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. |
HUNG-YANG SUNG et. al. | arxiv-cs.CL | 2025-11-10 |
| 421 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. |
YAN GAO et. al. | arxiv-cs.CL | 2025-11-09 |
| 422 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. |
ZHISHENG ZHANG et. al. | nips | 2025-11-07 |
| 423 | Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers, and Gradient Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this gap, we establish **the first benchmark for FL with DP** in end-to-end ASR. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. |
MARTIN PELIKAN et. al. | nips | 2025-11-07 |
| 424 | Objective Soups: Multilingual Multi-Task Modeling for Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to an important question: Is it more effective to separate highly conflicting objectives into different optimization levels or to keep them in a single level? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. |
A F M SAIF et. al. | nips | 2025-11-07 |
| 425 | MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. |
UMBERTO CAPPELLAZZO et. al. | nips | 2025-11-07 |
| 426 | BlockDecoder: Boosting ASR Decoders with Context and Merger Modules Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the later layers largely focus on merging acoustic and textual information for the final predictions. Leveraging this key insight, we propose **BlockDecoder**, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a **Merger** that combines information from the audio encoder and text encoder to generate output tokens. |
Darshan Prabhu; Preethi Jyothi; | nips | 2025-11-07 |
| 427 | VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. |
ZUWEI LONG et. al. | nips | 2025-11-07 |
| 428 | CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CantoASR,a collaborative ASR-LALM error correction framework that integrates forcedalignment for acoustic feature extraction, a LoRA-finetuned Whisper forimproved tone discrimination, and an instruction-tuned Qwen-Audio forprosody-aware correction. |
DAZHONG CHEN et. al. | arxiv-cs.CL | 2025-11-06 |
| 429 | WST: Weakly Supervised Transducer for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted inend-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavilyon large-scale, high-quality annotated data, which are often costly anddifficult to obtain. To mitigate this reliance, we propose a Weakly SupervisedTransducer (WST), which integrates a flexible training graph designed torobustly handle errors in the transcripts without requiring additionalconfidence estimation or auxiliary pre-trained models. |
DONGJI GAO et. al. | arxiv-cs.CL | 2025-11-05 |
| 430 | Intelligent Navigation Assistant for Campuses Using Speech Recognition, NLP And A* Algorithm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Abstract Campus navigation often produce challenges for students, staff, and visitors, specifically in large complex. This research presents an intelligent navigation assistant that combines automatic speech recognition (ASR), Natural Language Processing (NLP), and the A* pathfinding algorithm to tackle these issues through voice interaction. |
Kirti A. Satpute; Mrunal Wakadkar; Shruti Nimbalkar; Sneha Shelar; Prof. Kalpana Sonval; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2025-11-04 |
| 431 | Energy-Efficient Hardware Acceleration of Whisper ASR on A CGLA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While ASICs offer highefficiency, they lack the programmability to adapt to evolving algorithms. Toaddress this trade-off, we implement and evaluate Whisper’s core computationalkernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs)accelerator. |
Takuto Ando; Yu Eto; Ayumu Takeuchi; Yasuhiko Nakashima; | arxiv-cs.AR | 2025-11-04 |
| 432 | Transcription Accuracy of Automatic Speech Recognition for Orthodontic Clinical Records Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The aim of this study was to investigate the transcriptional accuracy of ASR systems using orthodontic clinical records as the experimental model. |
R. O’KANE et. al. | Journal of Dental Research | 2025-11-03 |
| 433 | Visual-Aware Speech Recognition for Noisy Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. |
Balaji Darur; Karan Singla; | emnlp | 2025-11-02 |
| 434 | CoVoGER: A Multilingual Multitask Benchmark for Speech-to-text Generative Error Correction with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CoVoGER, a benchmark for GER that covers both ASR and speech-to-text translation (ST) across 15 languages and 28 language pairs. |
Zhengdong Yang; Zhen Wan; Sheng Li; Chao-Han Huck Yang; Chenhui Chu; | emnlp | 2025-11-02 |
| 435 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We apply LLMs to ASR error correction in three paradigms. |
Victor Junqiu Wei; Weicheng Wang; Di Jiang; Yuanfeng Song; Lu Wang; | emnlp | 2025-11-02 |
| 436 | Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate this, we compare four strategies: (a) *normative* models trained on typical speech (no personalization), (b) *idiosyncratic* models completely personalized to individuals, (c) *dysarthric-normative* models trained on other dysarthric speakers, and (d) *dysarthric-idiosyncratic* models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than the idiosyncratic approach while requiring less than half as much personalized data (36. |
Vishnu Raja; Adithya V Ganesan; Anand Syamkumar; Ritwik Banerjee; H. Schwartz; | emnlp | 2025-11-02 |
| 437 | Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. |
WEIQIAO SHAN et. al. | emnlp | 2025-11-02 |
| 438 | In-Context Learning Boosts Speech Recognition Via Human-like Adaptation to Speakers and Language Varieties Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal (Phi-4-MM) using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19. |
NATHAN ROLL et. al. | emnlp | 2025-11-02 |
| 439 | LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. |
Keisuke Kamahori; Jungo Kasai; Noriyuki Kojima; Baris Kasikci; | emnlp | 2025-11-02 |
| 440 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Speech Back-Translation, a a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. |
Tianduo Wang; Lu Xu; Wei Lu; Shanbo Cheng; | emnlp | 2025-11-02 |
| 441 | Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. |
Linyang He; Qiaolin Wang; Xilin Jiang; Nima Mesgarani; | emnlp | 2025-11-02 |
| 442 | Generative Annotation for ASR Named Entity Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. |
YUANCHANG LUO et. al. | emnlp | 2025-11-02 |
| 443 | Dynamic Model-Bank Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. |
YANSHUO WANG et. al. | emnlp | 2025-11-02 |
| 444 | Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. |
Siyu Liang; Nicolas Ballier; Gina-Anne Levow; Richard Wright; | emnlp | 2025-11-02 |
| 445 | Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. |
Supriti Sinhamahapatra; Jan Niehues; | emnlp | 2025-11-02 |
| 446 | Towards Language-Agnostic STIPA: Universal Phonetic Transcription to Support Language Documentation at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a new dataset for South Levantine Arabic and present the first large-scale evaluation of STIPA models across 51 language families. |
Jacob Lee Suchardt; Hana El-Shazli; Pierluigi Cassotti; | emnlp | 2025-11-02 |
| 447 | Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that our approachimproves alignment between clean/noisy speech and text, producing speech tokensthat display a high degree of noiseinvariance, and improves ASR performance.Keeping Whisper frozen, we show an 82% reduction in error rate compared toWhisper, and 35% improvement over baseline methods on the VBDemand test set.Further analyses show that the learned token space generalizes well to bothseen and unseen acoustic conditions. |
SHREYAS GOPAL et. al. | arxiv-cs.CL | 2025-10-29 |
| 448 | BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BEARD (BEST-RQ Encoder Adaptation with Re-trainingand Distillation), a novel framework designed to adapt Whisper’s encoder usingunlabeled data. |
Raphaël Bagat; Irina Illina; Emmanuel Vincent; | arxiv-cs.CL | 2025-10-28 |
| 449 | RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The Bengali language, spoken extensively across South Asia and amongdiasporic communities, exhibits considerable dialectal diversity shaped bygeography, culture, and history. … |
MD. REZUWAN HASSAN et. al. | arxiv-cs.CL | 2025-10-28 |
| 450 | POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce POWSM (Phonetic Open Whisper-style SpeechModel), the first unified framework capable of jointly performing multiplephone-related tasks. |
CHIN-JOU LI et. al. | arxiv-cs.CL | 2025-10-28 |
| 451 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon asparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billiontotal parameters, of which only 6.1 billion are active per token. |
INCLUSION AI et. al. | arxiv-cs.CV | 2025-10-28 |
| 452 | Arabic Little STT: Arabic Children Speech Recognition Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the absence of child-specificspeech corpora is an essential gap that poses significant challenges. Toaddress this gap, we present our created dataset, Arabic Little STT, a datasetof Levantine Arabic child speech recorded in classrooms, containing 355utterances from 288 children (ages 6 – 13). |
Mouhand Alkadri; Dania Desouki; Khloud Al Jallad; | arxiv-cs.CL | 2025-10-27 |
| 453 | MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. |
Yongqi Shao; Bingxin Mei; Cong Tan; Hong Huo; Tao Fang; | mm | 2025-10-27 |
| 454 | Are ASR Foundation Models Generalized Enough to Capture Features of Regional Dialects for Low-resource Languages? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate theeffects of dialectal variations on ASR we develop a 78-hour annotated BengaliSpeech-to-Text (STT) corpus named Ben-10. |
TAWSIF TASHWAR DIPTO et. al. | arxiv-cs.CL | 2025-10-27 |
| 455 | LRW-Persian: Lip-reading in The Wild Dataset for Persian Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LRW-Persian, the largest in-the-wild Persianword-level lipreading dataset, comprising $743$ target words and over$414{,}000$ video samples extracted from more than $1{,}900$ hours of footageacross $67$ television programs. |
Zahra Taghizadeh; Mohammad Shahverdikondori; Arian Noori; Alireza Dadgarnia; | arxiv-cs.CV | 2025-10-26 |
| 456 | M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Multi-scale CIF(M-CIF), which performs multi-level alignment by integrating character andphoneme level supervision progressively distilled into subword representations,thereby enhancing robust acoustic-text alignment. |
RUIXIANG MAO et. al. | arxiv-cs.SD | 2025-10-25 |
| 457 | A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using The Pacific Northwest English Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a systematic evaluation of racial bias in four majorcommercial automatic speech recognition (ASR) systems using the PacificNorthwest English (PNWE) corpus. |
Michael Scott; Siyu Liang; Alicia Wassink; Gina-Anne Levow; | arxiv-cs.CL | 2025-10-25 |
| 458 | The Tonogenesis Continuum in Tibetan: A Computational Investigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Tonogenesis-the historical process by which segmental contrasts evolve intolexical tone-has traditionally been studied through comparative reconstructionand acoustic phonetics. We introduce a computational approach that quantifiesthe functional role of pitch at different stages of this sound change bymeasuring how pitch manipulation affects automatic speech recognition (ASR)performance. |
Siyu Liang; Zhaxi Zerong; | arxiv-cs.CL | 2025-10-25 |
| 459 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Throughempirical analysis, we discover that targeted architectural simplification canunlock the acoustic modeling potential of Whisper, a text-aligned AutomaticSpeech Recognition (ASR) model. |
Xin Zhang; Lin Li; Xiangni Lu; Jianquan Liu; Kong Aik Lee; | arxiv-cs.SD | 2025-10-23 |
| 460 | HIPA-MoE: A Parameter-Efficient Fine-Tuning Architecture with Hierarchical Adapter-Based Mixture-Of-Experts for Multilingual ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multilingual automatic speech recognition (MASR) has advanced significantly with self-supervised pretraining (SSL). However, conventional fine-tuning remains constrained by data … |
Xun Lu; Xuyang Wang; Gaofeng Cheng; Lin Zheng; Pengyuan Zhang; | 2025 Asia Pacific Signal and Information Processing … | 2025-10-22 |
| 461 | EvolveCaptions: Real-Time Collaborative ASR Adaptation for DHH Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Current ASR systems struggle to reliably recognize the speech of Deaf and Hard of Hearing (DHH) individuals, particularly in real-time communication. Existing personalization … |
Liang-Yuan Wu; Dhruv Jain; | Proceedings of the 27th International ACM SIGACCESS … | 2025-10-22 |
| 462 | Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we evaluate MBRdecoding for ASR and ST tasks on English and Japanese using Whisper and itsderivative models. |
Yuu Jinnai; | arxiv-cs.CL | 2025-10-22 |
| 463 | Chain-of-Thought Distillation for ASR Error Correction with Multimodal Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Large language models (LLMs) have demonstrated strong performance in automatic speech recognition (ASR) error correction. However, current approaches rely heavily on annotated … |
SHAOMENG YANG et. al. | 2025 Asia Pacific Signal and Information Processing … | 2025-10-22 |
| 464 | CARTGPT: Real-Time Correction of CART Captions Using Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Communication Access Realtime Translation (CART) is a widely used captioning technology among deaf and hard of hearing (DHH) individuals, valued for its high accuracy and ability … |
Liang-Yuan Wu; Andrea Kleiver; Dhruv Jain; | Proceedings of the 27th International ACM SIGACCESS … | 2025-10-22 |
| 465 | MLMA: Towards Multilingual ASR With Mamba-based Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, weintroduce MLMA (Multilingual Language Modeling with Mamba for ASR), a newapproach that leverages the Mamba architecture — an efficient state-spacemodel optimized for long-context sequence processing — for multilingual ASR.Using Mamba, MLMA implicitly incorporates language-aware conditioning andshared representations to support robust recognition across diverse languages.Experiments on standard multilingual benchmarks show that MLMA achievescompetitive performance compared to Transformer-based architectures. |
Mohamed Nabih Ali; Daniele Falavigna; Alessio Brutti; | arxiv-cs.CL | 2025-10-21 |
| 466 | VALLR: Visual ASR Language Model for Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. |
Marshall Thomas; Edward Fish; Richard Bowden; | iccv | 2025-10-20 |
| 467 | Hallucination Benchmark for Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, there is a critical need for new evaluationframeworks that can effectively identify and assess models with a heightenedpropensity for generating hallucinated content. |
Alkis Koudounas; Moreno La Quatra; Manuel Giollo; Sabato Marco Siniscalchi; Elena Baralis; | arxiv-cs.CL | 2025-10-18 |
| 468 | Probing The Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the untapped potential of Whisper, awell-established automatic speech recognition (ASR) foundation model, in thecontext of L2 spoken language assessment (SLA). |
Fu-An Chao; Bi-Cheng Yan; Berlin Chen; | arxiv-cs.CL | 2025-10-18 |
| 469 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis Via RLAIF Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often relyon costly emotion annotations or optimize indirect objectives that fail tocapture the emotional expressiveness and perceptual naturalness of speech,leading to generated speech that is accurate but emotionally flat. To addressthese challenges, we propose the RLAIF-SPA framework, incorporating aReinforcement Learning from AI Feedback (RLAIF) mechanism to employ AutomaticSpeech Recognition (ASR) and Large Language Model (LLM) techniques torespectively judge semantic accuracy and prosodic-emotional label alignment asa direct reward for emotional expressiveness and intelligibility optimization.Specifically, it leverages Prosodic Label Alignment to enhance expressivequality by jointly considering semantic accuracy and prosodic-emotionalalignment along four fine-grained dimensions: Structure, Emotion, Speed, andTone. |
QING YANG et. al. | arxiv-cs.CL | 2025-10-16 |
| 470 | Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a governance-centered ASR lifecycle as an emergent interdisciplinary framework for responsible ASR development and offer implications for researchers, practitioners, and policymakers seeking to address language marginalization in speech AI systems. |
JAY L. CUNNINGHAM et. al. | Proceedings of the AAAI/ACM Conference on AI, Ethics, and … | 2025-10-15 |
| 471 | End-to-end Speech Recognition with Similar Length Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In thispaper, we focus on speech recognition in those cases where the length of speechaligns closely with that of the corresponding text. |
Peng Fan; Wenping Wang; Fei Deng; | arxiv-cs.CL | 2025-10-12 |
| 472 | Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using theCEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisperreduces average WER but simultaneously widens disparities anddisproportionately harms lower-level learners. To address this, we propose twostrategies: (i) proficiency-aware multitask learning, jointly optimizing ASRwith proficiency classification, and (ii) targeted augmentation, applyingspectrogram masking to low-proficiency speech to counter imbalance. |
Ling Sun; Charlotte Zhu; Shuju Shi; | arxiv-cs.SD | 2025-10-12 |
| 473 | End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores acombined end-to-end architecture of pre-trained speech encoders and LargeLanguage Models (LLMs) for performing both Automatic Speech Recognition (ASR)and ST simultaneously. |
Nam Luu; Ondřej Bojar; | arxiv-cs.CL | 2025-10-11 |
| 474 | Voice-Enabled Local Language Translator Using Generative AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a voice-enabled local language translator powered by Generative AI, integrating Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), generative translation, and Text-to-Speech (TTS) synthesis for accurate, context-aware output. |
Dr. V. Shanmugapriya; J N Pravanthika; | International Scientific Journal of Engineering and … | 2025-10-10 |
| 475 | Accent-Invariant Automatic Speech Recognition Via Saliency-Driven Spectrogram Masking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For Persian, we introduce a newly collected dataset spanning multipleregional accents, establishing the first systematic benchmark for accentvariation in Persian ASR that fills a critical gap in multilingual speechresearch and provides a foundation for future studies on low-resource,linguistically diverse languages. |
Mohammad Hossein Sameti; Sepehr Harfi Moridani; Ali Zarean; Hossein Sameti; | arxiv-cs.CL | 2025-10-10 |
| 476 | Speech Recognition and Synthesis Models and Platforms for The Kazakh Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. |
Aidana Karibayeva; Vladislav Karyukin; Balzhan Abduali; Dina Amirova; | Information | 2025-10-10 |
| 477 | Serial-Parallel Dual-Path Architecture for Speaking Style Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel serial-paralleldual-path architecture for SSR that leverages acoustic-linguistic bimodalinformation. |
GUOJIAN LI et. al. | arxiv-cs.SD | 2025-10-09 |
| 478 | Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the Open ASRLeaderboard, a fully reproducible benchmark and interactive leaderboardcomparing 60+ open-source and proprietary systems across 11 datasets, includingdedicated multilingual and long-form tracks. |
VAIBHAV SRIVASTAV et. al. | arxiv-cs.CL | 2025-10-08 |
| 479 | How Much Speech Data Is Necessary for ASR in African Languages? An Evaluation of Data Scaling in Kinyarwanda and Kikuyu Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate Whisper’sperformance through comprehensive experiments on two Bantu languages:systematic data scaling analysis on Kinyarwanda using training sets from 1 to1,400 hours, and detailed error characterization on Kikuyu using 270 hours oftraining data. |
BENJAMIN AKERA et. al. | arxiv-cs.CL | 2025-10-08 |
| 480 | Linguistically Informed Tokenization Improves ASR for Underresourced Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that alinguistically informed phonemic tokenization system substantially improves WERand CER compared to a baseline orthographic tokenization scheme. |
Massimo Daul; Alessio Tosolini; Claire Bowern; | arxiv-cs.CL | 2025-10-07 |
| 481 | BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present BanglaTalk, thefirst real-time speech assistance system for Bengali regional dialects.BanglaTalk follows the client-server architecture and uses the Real-timeTransport Protocol (RTP) to ensure low-latency communication. |
Jakir Hasan; Shubhashis Roy Dipta; | arxiv-cs.CL | 2025-10-07 |
| 482 | How I Built ASR for Endangered Languages with A Spoken Dictionary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inthis paper, we explore how little data, and in what form, is needed to buildASR for critically endangered languages. |
Christopher Bartley; Anton Ragni; | arxiv-cs.CL | 2025-10-06 |
| 483 | Phonetic Analysis of Real and Synthetic Speech Using HuBERT Embeddings: Perspectives for Deepfake Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The growing sophistication of speech generated by Artificial Intelligence (AI) has introduced new challenges in audio deepfake detection. Text-to-speech (TTS) and voice conversion … |
DIA ELHAK TEMMAR et. al. | 2025 IEEE International Conference on Systems, Man, and … | 2025-10-05 |
| 484 | Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling Than CoT Prompting? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematicallycompare CoT and Direct prompting under increasing amounts of S2TT data. |
Oriol Pareras; Gerard I. Gállego; Federico Costa; Cristina España-Bonet; Javier Hernando; | arxiv-cs.CL | 2025-10-03 |
| 485 | Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Analyzing CoT through attribution methods, robustnessevaluations with corrupted transcripts, and prosody-awareness, we find that itlargely mirrors cascaded behavior, relying mainly on transcripts while barelyleveraging speech. |
JACOBO ROMERO-DÍAZ et. al. | arxiv-cs.CL | 2025-10-03 |
| 486 | EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EvolveCaptions, a real-time, collaborative ASRadaptation system that supports in-situ personalization with minimal effort.Hearing participants correct ASR errors during live conversations. |
Liang-Yuan Wu; Dhruv Jain; | arxiv-cs.HC | 2025-10-02 |
| 487 | Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: ASR has achieved remarkable global progress, yet African low-resourcelanguages remain rigorously underrepresented, producing barriers to digitalinclusion across the continent with more than +2000 languages. This systematicliterature review (SLR) explores research on ASR for African languages with afocus on datasets, models and training methods, evaluation techniques,challenges, and recommends future directions. |
SUKAIRAJ HAFIZ IMAM et. al. | arxiv-cs.CL | 2025-10-01 |
| 488 | Backdoor Attacks Against Speech Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present the firstsystematic study of audio backdoor attacks against speech language models. |
Alexandrine Fortier; Thomas Thebaud; Jesús Villalba; Najim Dehak; Patrick Cardinal; | arxiv-cs.CL | 2025-10-01 |
| 489 | ASR Under Noise: Exploring Robustness for Sundanese and Javanese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate the robustness of Whisper-based automatic speech recognition(ASR) models for two major Indonesian regional languages: Javanese andSundanese. |
Salsabila Zahirah Pranida; Muhammad Cendekia Airlangga; Rifo Ahmad Genadi; Shady Shehata; | arxiv-cs.CL | 2025-09-30 |
| 490 | Beyond WER: Probing Whisper’s Sub-token Decoder Across Diverse Language Resource Levels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While large multilingual automatic speech recognition (ASR) models achieveremarkable performance, the internal mechanisms of the end-to-end pipeline,particularly concerning fairness and efficacy across languages, remainunderexplored. This paper introduces a fine-grained analysis of Whisper’smultilingual decoder, examining its sub-token hypotheses during transcriptionacross languages with various resource levels. |
Siyu Liang; Nicolas Ballier; Gina-Anne Levow; Richard Wright; | arxiv-cs.CL | 2025-09-29 |
| 491 | Confidence-Guided Error Correction for Disordered Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate the use of large language models (LLMs) as post-processingmodules for automatic speech recognition (ASR), focusing on their ability toperform error correction for disordered speech. |
Abner Hernandez; Tomás Arias Vergara; Andreas Maier; Paula Andrea Pérez-Toro; | arxiv-cs.CL | 2025-09-29 |
| 492 | HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, weintroduce HiKE: the Hierarchical Korean-English code-switching benchmark, thefirst globally accessible evaluation framework for Korean-English CS, aiming toprovide a means for the precise evaluation of multilingual ASR models and tofoster research in the field. |
Gio Paik; Yongbeom Kim; Soungmin Lee; Sangmin Ahn; Chanwoo Kim; | arxiv-cs.CL | 2025-09-29 |
| 493 | Game-Oriented ASR Error Correction Via RAG-Enhanced LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, general ASR systems struggle withgaming-specific challenges like short phrases, rapid speech, jargon, and noise,leading to frequent errors. To address this, we propose the GO-AEC framework,which integrates large language models, Retrieval-Augmented Generation (RAG),and a data augmentation strategy using LLMs and TTS. |
Yan Jiang; Yongle Luo; Qixian Zhou; Elvis S. Liu; | arxiv-cs.AI | 2025-09-28 |
| 494 | MeanFlowSE: One-Step Generative Speech Enhancement Via MeanFlow Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech enhancement (SE) recovers clean speech from noisy signals and is vitalfor applications such as telecommunications and automatic speech recognition(ASR). While generative … |
YIKE ZHU et. al. | arxiv-cs.SD | 2025-09-27 |
| 495 | Align2Speak: Improving TTS for Low Resource Languages Via ASR-Guided Online Preference Optimization Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Developing high-quality text-to-speech (TTS) systems for low-resourcelanguages is challenging due to the scarcity of paired text and speech data. Incontrast, automatic speech … |
SHEHZEEN HUSSAIN et. al. | arxiv-cs.AI | 2025-09-25 |
| 496 | Lightweight Front-end Enhancement for Robust ASR Via Frame Resampling and Sub-Band Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes optimizations to reduce SE computational costswithout compromising ASR performance. |
Siyi Zhao; Wei Wang; Yanmin Qian; | arxiv-cs.SD | 2025-09-25 |
| 497 | SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We developed a robust processingpipeline to align and segment long-form recordings into clean, 30-secondaudio-transcript pairs suitable for model training. We use this dataset tofine-tune several OpenAI Whisper models (small, medium, large-v3, andlarge-v3-turbo), achieving significant Word Error Rate (WER) reductions onstandard Slovak benchmarks like Common Voice and FLEURS. |
Erik Božík; Marek Šuppa; | arxiv-cs.CL | 2025-09-23 |
| 498 | M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address thischallenge, in our previous work, we introduced two auxiliary tasks, namely, ASRerror detection and ASR error correction, and we proposed a novel multimodalfusion (MF) method for learning modality-specific and modality-invariantrepresentations across different modalities. |
JIAJUN HE et. al. | arxiv-cs.HC | 2025-09-23 |
| 499 | LOTUSDIS: A Thai Far-field Meeting Corpus for Robust Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LOTUSDIS, a publicly available Thai meeting corpus designed toadvance far-field conversational ASR. |
Pattara Tipaksorn; Sumonmas Thatphithakkul; Vataya Chunwijitra; Kwanchiva Thangthai; | arxiv-cs.CL | 2025-09-23 |
| 500 | Explore The Reinforcement Learning for The LLM Based ASR and TTS System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose alightweight RL framework tailored for audio-based LLMs that can process audioinputs and generate audio outputs. |
CHANGFENG GAO et. al. | arxiv-cs.SD | 2025-09-22 |
| 501 | Conversational Orientation Reasoning: Egocentric-to-Allocentric Navigation with Multimodal Chain-of-Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ConversationalOrientation Reasoning (COR), a new benchmark designed for Traditional Chineseconversational navigation projected from real-world environments, addressingegocentric-to-allocentric reasoning in non-English and ASR-transcribedscenarios. We propose a multimodal chain-of-thought (MCoT) framework, whichintegrates ASR-transcribed speech with landmark coordinates through astructured three-step reasoning process: (1) extracting spatial relations, (2)mapping coordinates to absolute directions, and (3) inferring user orientation.A curriculum learning strategy progressively builds these capabilities onTaiwan-LLM-13B-v2.0-Chat, a mid-sized model representative ofresource-constrained settings. |
Yu Ti Huang; | arxiv-cs.LG | 2025-09-20 |
| 502 | A Systematic Literature Review on Bias Evaluation and Mitigation in Automatic Speech Recognition Models for Low-Resource African Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With recent advancements in speech recognition, it is crucial to ensure that automatic speech recognition (ASR) systems do not exhibit systematic biases, such as those related to … |
Joyce Nakatumba‐Nabende; Sulaiman Kagumire; Caroline Kantono; Peter Nabende; | ACM Computing Surveys | 2025-09-20 |
| 503 | Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art automatic speech recognition (ASR) models like Whisper,perform poorly on atypical speech, such as that produced by individuals withdysarthria. |
Vishnu Raja; Adithya V Ganesan; Anand Syamkumar; Ritwik Banerjee; H Andrew Schwartz; | arxiv-cs.SD | 2025-09-20 |
| 504 | MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Progress in NV-aware ASR hasbeen hindered by the lack of high-quality, well-annotated datasets. To addressthis gap, we introduce MNV-17, a 7.55-hour performative Mandarin speechdataset. |
JIALONG MAI et. al. | arxiv-cs.SD | 2025-09-19 |
| 505 | Thinking in Cocktail Party: Chain-of-Thought and Reinforcement Learning for Target Speaker Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe thespeech of a specified target speaker from multi-speaker mixtures in cocktailparty scenarios. Recent … |
Yiru Zhang; Hang Su; Lichun Fan; Zhenbo Luo; Jian Luan; | arxiv-cs.SD | 2025-09-19 |
| 506 | HARNESS: Lightweight Distilled Arabic Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, weintroduce HArnESS, the first Arabic-centric self-supervised speech modelfamily, designed to capture Arabic speech nuances. |
Vrunda N. sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2025-09-18 |
| 507 | Impact of Phonetics on Speaker Identity in Adversarial Voice Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we analyze adversarialaudio at the phonetic level and show that perturbations exploit systematicconfusions such as vowel centralization and consonant substitutions. |
Daniyal Kabir Dar; Qiben Yan; Li Xiao; Arun Ross; | arxiv-cs.SD | 2025-09-18 |
| 508 | Speech Language Models for Under-Represented Languages: Insights from Wolof Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present our journey in training a speech language model for Wolof, anunderrepresented language spoken in West Africa, and share key insights. |
Yaya Sy; Dioula Doucouré; Christophe Cerisara; Irina Illina; | arxiv-cs.CL | 2025-09-18 |
| 509 | Frustratingly Easy Data Augmentation for Low-Resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces three self-contained data augmentation methods forlow-resource Automatic Speech Recognition (ASR). |
Katsumi Ibaraki; David Chiang; | arxiv-cs.CL | 2025-09-18 |
| 510 | Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This report introduces Canary-1B-v2, a fast, robust multilingual model forAutomatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). |
MONICA SEKOYAN et. al. | arxiv-cs.CL | 2025-09-17 |
| 511 | Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes our efforts to develop the methodology and tools forhuman-subject experiments with MCV. |
JAN JANAK et. al. | arxiv-cs.NI | 2025-09-17 |
| 512 | GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: End-to-end multi-talker automatic speech recognition (MTASR) facessignificant challenges in accurately transcribing overlapping speech,especially under high-overlap conditions. To … |
Yujie Guo; Jiaming Zhou; Yuhang Jia; Shiwan Zhao; Yong Qin; | arxiv-cs.SD | 2025-09-16 |
| 513 | PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a Pronunciation-Aware Contextualized (PAC) framework toaddress two key challenges in Large Language Model (LLM)-based Automatic SpeechRecognition (ASR) systems: effective pronunciation modeling and robusthomophone discrimination. |
LI FU et. al. | arxiv-cs.CL | 2025-09-16 |
| 514 | Fun-ASR Technical Report IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, wepresent Fun-ASR, a large-scale, LLM-based ASR system that synergisticallycombines massive data, large model capacity, LLM integration, and reinforcementlearning to achieve state-of-the-art performance across diverse and complexspeech recognition scenarios. |
KEYU AN et. al. | arxiv-cs.CL | 2025-09-15 |
| 515 | In-domain SSL Pre-training and Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate the benefits of domain-specific self-supervisedpre-training for both offline and streaming ASR in Air Traffic Control (ATC)environments. |
JAROD DURET et. al. | arxiv-cs.CL | 2025-09-15 |
| 516 | WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose WhisTLE, a deeply supervised,text-only adaptation method for pretrained encoder-decoder ASR models. |
Akshat Pandey; Karun Kumar; Raphael Tang; | arxiv-cs.CL | 2025-09-12 |
| 517 | Prominence-aware Automatic Speech Recognition for Conversational Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates prominence-aware automatic speech recognition (ASR)by combining prominence detection and speech recognition for conversationalAustrian German. |
Julian Linke; Barbara Schuppler; | arxiv-cs.CL | 2025-09-12 |
| 518 | Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a beamforming approach that processesspecific angular sectors based on their spherical polar coordinates beforeapplying an end-to-end multichannel, multi-speaker ASR system. |
Can Cui; Paul Magron; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.SD | 2025-09-12 |
| 519 | TextlessRAG: End-to-End Visual Document RAG By Speech Without Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TextlessRAG, thefirst end-to-end framework for speech-based question answering over large-scaledocument images. |
PEIJIN XIE et. al. | arxiv-cs.CV | 2025-09-09 |
| 520 | Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally … |
Lusheng Zhang; Shie Wu; Zhongxun Wang; | Symmetry | 2025-09-08 |
| 521 | The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To advance state-of-the-art (SOTA) ASRmodels, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. |
WILLIAM CHEN et. al. | arxiv-cs.CL | 2025-09-08 |
| 522 | TSPC: A Two-Stage Phoneme-Centric Architecture for Code-switching Vietnamese-English Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel architecture forVietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). |
Minh N. H. Nguyen; Anh Nguyen Tran; Dung Truong Dinh; Nam Van Vo; | arxiv-cs.SD | 2025-09-07 |
| 523 | Enhancing The Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a purified semanticcorrelation joint modeling (PSC-Joint) approach. |
YUE GU et. al. | arxiv-cs.CL | 2025-09-06 |
| 524 | Contextualized Token Discrimination for Speech Search Query Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With thegrowing popularity of speech search driven by Automated Speech Recognition(ASR) systems, this paper introduces a novel method named Contextualized TokenDiscrimination (CTD) to conduct effective speech query correction. |
JUNYU LU et. al. | arxiv-cs.SD | 2025-09-04 |
| 525 | Denoising GER: A Noise-Robust Generative Error Correction with LLM for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address theseissues, this paper proposes a noise-robust multi-modal GER framework (DenoisingGER). |
YANYAN LIU et. al. | arxiv-cs.SD | 2025-09-04 |
| 526 | PARCO: Phoneme-Augmented Robust Contextual ASR Via Contrastive Entity Disambiguation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) systems struggle with domain-specificnamed entities, especially homophones. Contextual ASR improves recognition butoften fails to capture … |
Jiajun He; Naoki Sawada; Koichi Miyazaki; Tomoki Toda; | arxiv-cs.CL | 2025-09-04 |
| 527 | WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address thischallenge, we propose WenetSpeech-Pipe, an integrated pipeline for buildinglarge-scale speech corpus with multi-dimensional annotation tailored for speechunderstanding and generation. |
LONGHAO LI et. al. | arxiv-cs.SD | 2025-09-04 |
| 528 | LatPhon: Lightweight Multilingual G2P for Romance Languages and English Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech(TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST)and alignment systems, … |
Luis Felipe Chary; Miguel Arjona Ramirez; | arxiv-cs.CL | 2025-09-03 |
| 529 | SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking Under Domain Shift in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While low-rank adaptation (LoRA) is widelyused in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA,PiSSA, and SVFT, are developed mainly for language and vision tasks, withlimited validation in speech. This work presents the first comprehensiveintegration and benchmarking of these PEFT methods within ESPnet. |
Pu Wang; Shinji Watanabe; Hugo Van hamme; | arxiv-cs.CL | 2025-09-02 |
| 530 | CuckooAttack: Towards Practical Backdoor Attack Against Automatic Speech Recognition Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Deep learning-based automatic speech recognition (ASR) systems are capable of transcribing input audio of arbitrary duration into character sequences, which are widely used in … |
BOWEN LI et. al. | IEEE Transactions on Dependable and Secure Computing | 2025-09-01 |
| 531 | CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes CabinSep, a lightweight neuralmask-based minimum variance distortionless response (MVDR) speech separationapproach, to reduce speech recognition errors in back-end automatic speechrecognition (ASR) models. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2025-09-01 |
| 532 | Refining Transcripts With TV Subtitles By Prompt-Based Weakly Supervised Training of ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study proposes a novel approach to using TV subtitles within a weaklysupervised (WS) Automatic Speech Recognition (ASR) framework. |
Xinnian Zhao; Hugo Van Hamme; | arxiv-cs.CL | 2025-09-01 |
| 533 | A Unified Denoising and Adaptation Framework for Self-Supervised Bengali Dialectal ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge is compoundedby two persistent and intertwined factors: the language’s vast dialectaldiversity and the prevalence of acoustic noise in real-world environments.While state-of-the-art self-supervised learning (SSL) models have advanced ASRfor low-resource languages, they often lack explicit mechanisms to handleenvironmental noise during pre-training or specialized adaptation strategiesfor the complex phonetic and lexical variations across Bengali dialects. Thispaper introduces a novel, unified framework designed to address these dualchallenges simultaneously. |
Swadhin Biswas; Tuhin Sheikh; | arxiv-cs.SD | 2025-08-31 |
| 534 | Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we proposeextracting serialized output prompts (SOP) and explicitly guiding the LLM usingstructured prompts to improve system performance (SOP-MT-ASR). |
HAO SHI et. al. | arxiv-cs.CL | 2025-08-31 |
| 535 | Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, wepropose the first unified framework capable of handling diverse combinations ofsign language, lip movements, and audio for spoken-language text generation. |
Jeong Hun Yeo; Hyeongseop Rha; Sungjune Park; Junil Won; Yong Man Ro; | arxiv-cs.CV | 2025-08-28 |
| 536 | MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Early screening for Alzheimer’s Disease (AD) through speech presents apromising non-invasive approach. However, challenges such as limited data andthe lack of fine-grained, … |
Yongqi Shao; Binxin Mei; Cong Tan; Hong Huo; Tao Fang; | arxiv-cs.SD | 2025-08-28 |
| 537 | Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD Database Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthric speech recognition faces challenges from severity variations anddisparities relative to normal speech. Conventional approaches individuallyfine-tune ASR models … |
Qing Xiao; Yingshan Peng; PeiPei Zhang; | arxiv-cs.SD | 2025-08-26 |
| 538 | Zero-shot Context Biasing with Trie-based Decoding Using Synthetic Multi-Pronunciation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a synthesis-drivenmulti-pronunciation contextual biasing method that performs zero-shotcontextual ASR on a pretrained Whisper model. |
Changsong Liu; Yizhou Peng; Eng Siong Chng; | arxiv-cs.CL | 2025-08-25 |
| 539 | Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate four state-of-the-art ASR systems on eight publiclyavailable datasets that capture six dimensions of difficulty: domain-specific,accented, noisy, age-variant, impaired, and spontaneous speech. |
Theresa Pekarek Rosin; Julia Gachot; Henri-Leon Kordt; Matthias Kerzel; Stefan Wermter; | arxiv-cs.RO | 2025-08-25 |
| 540 | Whisper Based Cross-Lingual Phoneme Recognition Between Vietnamese and English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike many languages, Vietnamese relies on tonal variations todistinguish word meanings, whereas English features stress patterns andnon-standard pronunciations that hinder phoneme alignment between the twolanguages. To address this challenge, we propose a novel bilingual speechrecognition approach with two primary contributions: (1) constructing arepresentative bilingual phoneme set that bridges the differences betweenVietnamese and English phonetic systems; (2) designing an end-to-end systemthat leverages the PhoWhisper pre-trained encoder for deep high-levelrepresentations to improve phoneme recognition. |
Nguyen Huu Nhat Minh; Tran Nguyen Anh; Truong Dinh Dung; Vo Van Nam; Le Pham Tuyen; | arxiv-cs.CL | 2025-08-22 |
| 541 | Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We compare flat-start training across multiple datasets, SSLrepresentations (WavLM, XEUS), and decoder architectures. |
ANYU YING et. al. | arxiv-cs.LG | 2025-08-22 |
| 542 | UniCoM: A Universal Code-Switching Speech Generator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, systemscapable of handling this phenomenon remain underexplored, primarily due to thescarcity of suitable datasets. To resolve this issue, we propose UniversalCode-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CSsamples without altering sentence semantics. |
Sangmin Lee; Woojin Chung; Seyun Um; Hong-Goo Kang; | arxiv-cs.CL | 2025-08-21 |
| 543 | Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using A Speech Error Database Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This analysis demonstrates the atabase’s effectivenessas a diagnostic tool for ASR system performance. |
John Alderete; Macarious Kin Fung Hui; Aanchan Mohan; | arxiv-cs.CL | 2025-08-18 |
| 544 | Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although state-of-the-art ASR systems based onRecurrent Neural Network Transducers (RNN-T) can perform real-timetranscription, achieving streaming translation in real-time remains asignificant challenge. To address this issue, we propose a simultaneoustranslation approach that effectively balances translation quality and latency.We also investigate efficient integration of ASR and MT, leveraging linguisticcues generated by the ASR system to manage context and utilizing efficientbeam-search pruning techniques such as time-out and forced finalization tomaintain system’s real-time factor. |
ZEESHAN AHMED et. al. | arxiv-cs.CL | 2025-08-18 |
| 545 | Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach that enhances ASRby distilling contextual knowledge from LLaMA models into Whisper. |
Duygu Altinok; | arxiv-cs.CL | 2025-08-18 |
| 546 | PPGs-BERT: Leveraging Phoneme Sequence and BERT for Alzheimer’s Disease Detection from Spontaneous Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Alzheimer’s Disease (AD) is a neurodegenerative condition characterized by linguistic impairments. While ASR and LLMs show promise in AD detection, ASR often normalizes key … |
QI SUN et. al. | Interspeech 2025 | 2025-08-17 |
| 547 | Word-Level Error Analysis in Decoding Systems: From Speech Recognition to Brain-Computer Interfaces Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Brain-to-text (BTT) systems that decode attempted speech from neural activity have achieved 4 . 2% word error rate (WER). These systems demonstrate potential for daily use similar … |
Jingya Huang; Aashish N. Patel; Sowmya Manojna Narasimha; Gal Mishne; Vikash Gilja; | Interspeech 2025 | 2025-08-17 |
| 548 | Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent years, the increasing demand for parameter-efficient automatic speech recognition (ASR) systems has driven researchers to explore innovative architectures and techniques … |
Thomas Rolland; Alberto Abad; | Interspeech 2025 | 2025-08-17 |
| 549 | Assessing The Performance and Efficiency of Mamba ASR in Low-Resource Scenarios Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Mamba, a state space model-based architecture, is emerging as a strong alternative to Transformer models, showing equal or superior performance in sequence generation, including … |
RODOLFO ZEVALLOS et. al. | Interspeech 2025 | 2025-08-17 |
| 550 | Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences … |
Lubos Marcinek; J. Beskow; Joakim Gustafson; | Interspeech 2025 | 2025-08-17 |
| 551 | Synthetic Dysarthric Speech: A Supplement, Not A Substitute for Authentic Data in Dysarthric Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthric speech recognition (DSR) is an emerging field that can enhance social interactions and mental health for individuals with dysarthria. However, the lack of sufficient … |
Jingting Li; Keyi Feng; Xinran Zhao; Yan Wang; Su-Jing Wang; | Interspeech 2025 | 2025-08-17 |
| 552 | Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent developments in large-scale speech foundation models have further pushed the boundaries of automatic speech recognition (ASR) capabilities, making them excellent candidates … |
ZHAO YANG et. al. | Interspeech 2025 | 2025-08-17 |
| 553 | ASR Confidence Estimation Using True Class Lexical Similarity Score Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Deep Neural Networks (DNN) often exhibit overconfi-dence, leading to poor confidence calibration in Automatic Speech Recognition (ASR) models. State-Of-The-Art (SOTA) approaches … |
Nagarathna Ravi; Thishyan Raj T; Ravi Teja Chaganti; Vipul Arora; | Interspeech 2025 | 2025-08-17 |
| 554 | Towards Inclusive and Fair ASR: Insights from The SAPC Challenge for Optimizing Disordered Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: ASR has advanced significantly, yet remains limited for impaired speakers due to data scarcity. In response to this gap, the Speech Accessibility Project (SAP) represents a … |
Nada Gohider; O. Basir; | Interspeech 2025 | 2025-08-17 |
| 555 | Switch Conformer with Universal Phonetic Experts for Multilingual ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multilingual end-to-end ASR presents significant challenges due to the need to accommodate diverse writing systems, lexi-cons, and grammatical structures. Existing methods often … |
M. Mimura; Jaeyoung Lee; Tatsuya Kawahara; | Interspeech 2025 | 2025-08-17 |
| 556 | CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Whisper is a powerful automatic speech recognition (ASR) model. However, its zero-shot performance on low-resource speech requires further improvement, especially in dysarthric … |
TIANYI TAN et. al. | Interspeech 2025 | 2025-08-17 |
| 557 | GLCLAP: A Novel Contrastive Learning Pre-trained Model for Contextual Biasing in ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recently, Automatic Speech Recognition (ASR) that supports prompts has shown remarkable versatility. For contextual biasing with these systems, a pivotal factor lies in obtaining … |
YUXIANG KONG et. al. | Interspeech 2025 | 2025-08-17 |
| 558 | End-to-End Speech Translation Guided By Robust Translation Capability of Large Language Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We present an end-to-end speech translation (ST) model that uses a large language model (LLM) to guide the translation process. Recent advances in LLMs have shown strong … |
Yosuke Higuchi; Tetsuji Ogawa; Tetsunori Kobayashi; | Interspeech 2025 | 2025-08-17 |
| 559 | J-j-j-just Stutter: Benchmarking Whisper’s Performance Disparities on Different Stuttering Patterns Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Despite their prevalence in everyday technologies, automated speech recognition (ASR) systems often struggle with disfluent speech. To diagnose and address these technical … |
Charan Sridhar; Shaomei Wu; | Interspeech 2025 | 2025-08-17 |
| 560 | Building An Accurate Open-Source Hebrew ASR System Through Crowdsourcing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) for Hebrew faces significant challenges due to limited resources and rich morphology. While recent advances have improved high-resource … |
Y. Marmor; Y. Lifshitz; Y. Snapir; Kinneret Misgav; | Interspeech 2025 | 2025-08-17 |
| 561 | Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription Uncertainty Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Efficiently selecting training data is crucial for improving automatic speech recognition (ASR) models while minimizing annotation costs. This research extends TypiClust, a data … |
Komei Hiruta; Yosuke Yamano; Hideaki Tamori; | Interspeech 2025 | 2025-08-17 |
| 562 | Is Synthetic Data Truly Effective for Training Speech Language Models? Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The development of Large Language Models (LLMs) has expanded beyond text-based tasks to speech applications such as Automatic Speech Recognition (ASR) and Automated Speech … |
Tomoya Mizumoto; Atsushi Kojima; Yusuke Fujita; Lianbo Liu; Yui Sudo; | Interspeech 2025 | 2025-08-17 |
| 563 | Scaling Pseudo-labeling Data for End-to-end Low-resource Speech Translation (the Case of Kurdish Language) Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper we propose a pseudo-labeling pipeline to generate End-to-End Speech to Text Translation (E2E S2TT) data for low-resource languages. This pipeline allows us to … |
Mohammad MohammadAmini; Aghilas Sini; Marie Tahon; Antoine Laurent; | Interspeech 2025 | 2025-08-17 |
| 564 | Why Is Children’s ASR So Difficult? Analyzing Children’s Phonological Error Patterns Using SSL-based Phoneme Recognizers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Children’s automatic speech recognition (child ASR) is generally more challenging than adult ASR, as children’s speech differs from that of adults and continuously evolves with a … |
Koharu Horii; Naohiro Tawara; Atsunori Ogawa; Shoko Araki; | Interspeech 2025 | 2025-08-17 |
| 565 | Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) for pathologic speech remains a major challenge due to high variability in articulation, phonation, and prosody distortions. In this work, we … |
Ilja Baumann; Dominik Wagner; K. Riedhammer; Tobias Bocklet; | Interspeech 2025 | 2025-08-17 |
| 566 | Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Zero-shot automatic speech recognition (ASR) for children is challenging due to pronounced acoustic and linguistic mis-matches, speaker variability and limited annotated data. … |
Abhijit Sinha; H. Kathania; Mikko Kurimo; | Interspeech 2025 | 2025-08-17 |
| 567 | Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR Without Accuracy Loss Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we introduce two methods, Simultaneous Masked and Unmasked Decoding (SMUD) and speculative decoding masking, into Partially autoregressive (PAR) decoding. These … |
K. Okabe; Hitoshi Yamamoto; | Interspeech 2025 | 2025-08-17 |
| 568 | Enhancing Target-speaker Automatic Speech Recognition Using Multiple Speaker Embedding Extractors with Virtual Speaker Embedding Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Target-speaker automatic speech recognition (TS-ASR) utilizes speaker embeddings to identify a target speaker in multi-talker environments. While high-performance speaker … |
Ju-Seok Seong; Jeonghwan Choi; Ye-Rin Jeoung; Ilseok Kim; Joon-Hyuk Chang; | Interspeech 2025 | 2025-08-17 |
| 569 | Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum Disorder Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Individuals with autism spectrum disorder (ASD) exhibit unique speech patterns that challenge conventional automatic speech recognition (ASR) systems. However, research on … |
Yeseul Park; Bowon Lee; | Interspeech 2025 | 2025-08-17 |
| 570 | Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper introduces a novel encoder architecture designed to enhance transducer-based dual-mode automatic speech recognition (ASR). Our approach leverages the selective … |
Takafumi Moriya; Masato Mimura; Kiyoaki Matsui; Hiroshi Sato; Kohei Matsuura; | Interspeech 2025 | 2025-08-17 |
| 571 | Exploring The Limits of Conformer CTC-Encoder for Speech Emotion Recognition Using Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Conformer CTC-Encoders have consistently delivered state-of-the-art results in the field of Automatic Speech Recognition (ASR); however, their merits for tasks that demand more … |
E. MORAIS et. al. | Interspeech 2025 | 2025-08-17 |
| 572 | Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Powered by self-supervised learning (SSL) on vast amounts of unlabeled data, a computationally intensive audiovisual encoder—a hybrid architecture combining ResNet and … |
Zhengyang Li; Pascal Reichert; Thomas Graave; Patrick Blumenberg; Tim Fingscheidt; | Interspeech 2025 | 2025-08-17 |
| 573 | Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in The Speech Accessibility Project Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We present our dysarthric speech recognition system submitted to the Interspeech 2025 Speech Accessibility Project Challenge. This challenge is a competition aimed at improving … |
Kaito Takahashi; Keigo Hojo; Toshimitsu Sakai; Yukoh Wakabayashi; N. Kitaoka; | Interspeech 2025 | 2025-08-17 |
| 574 | Towards Atypical Speech Transcription Using LLM-based ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Training a linear transformation between speech encoders and LLMs enable LLMs to transcribe speech. SLAM-ASR is one such recently proposed architecture. This paper examines its … |
Jinda Zhang; Aanchan Mohan; | Interspeech 2025 | 2025-08-17 |
| 575 | Leveraging Geographic Metadata for Dialect-Aware Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems struggle with di-alectal variations due to significant phonetic, morphosyntactic, and lexical differences across regions. Traditional … |
Pouya Mehralian; Hugo Van hamme; | Interspeech 2025 | 2025-08-17 |
| 576 | SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. |
Zhaoxi Mu; Xinyu Yang; Gang Wang; | ijcai | 2025-08-16 |
| 577 | Investigating Transcription Normalization in The Faetar ASR Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine the role of transcription inconsistencies in the Faetar AutomaticSpeech Recognition benchmark, a challenging low-resource ASR benchmark. |
Leo Peckham; Michael Ong; Naomi Nagy; Ewan Dunbar; | arxiv-cs.CL | 2025-08-15 |
| 578 | Analysis of Domain Shift Across ASR Architectures Via TTS-Enabled Separation of Target Domain and Acoustic Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze automatic speech recognition (ASR) modeling choices under domainmismatch, comparing classic modular and novel sequence-to-sequence (seq2seq)architectures. |
Tina Raissi; Nick Rossenbach; Ralf Schlüter; | arxiv-cs.SD | 2025-08-13 |
| 579 | Assessing The Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates the feasibility of lightweight Whisper models (Tiny,Base, Small) for Urdu speech recognition in low-resource settings. |
Abdul Rehman Antall; Naveed Akhtar; | arxiv-cs.CL | 2025-08-13 |
| 580 | A Comparative Analysis on ASR System Combination for Attention, CTC, Factored Hybrid, and Transducer Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we compare model combination acrosspopular ASR architectures. |
NOURELDIN BAYOUMI et. al. | arxiv-cs.SD | 2025-08-13 |
| 581 | Revealing The Role of Audio Channels in ASR Performance Degradation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Pre-trained automatic speech recognition (ASR) models have demonstratedstrong performance on a variety of tasks. However, their performance candegrade substantially when the input … |
Kuan-Tang Huang; Li-Wei Chen; Hung-Shin Lee; Berlin Chen; Hsin-Min Wang; | arxiv-cs.SD | 2025-08-12 |
| 582 | Munsit at NADI 2025 Shared Task 2: Pushing The Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a scalabletraining pipeline that combines weakly supervised learning with supervisedfine-tuning to develop a robust Arabic ASR model. |
Mahmoud Salhab; Shameed Sait; Mohammad Abusheikh; Hasan Abusheikh; | arxiv-cs.CL | 2025-08-12 |
| 583 | Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Speech Recognition (ASR) due to phoneme distortions and high variability.While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shownpromise, their effectiveness in dysarthric speech remains unclear. This studysystematically benchmarks these models with different decoding strategies,including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). |
Ahmed Aboeitta; Ahmed Sharshar; Youssef Nafea; Shady Shehata; | arxiv-cs.SD | 2025-08-11 |
| 584 | A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further optimize AEC’sdownstream applications, we introduce a novel post-processing strategyemploying tailored parameters designed specifically for tasks such as VoiceActivity Detection (VAD) and Automatic Speech Recognition (ASR), thus enhancingtheir overall efficacy. |
Yiheng Jiang; Tian Biao; | arxiv-cs.SD | 2025-08-10 |
| 585 | Fairness of Automatic Speech Recognition: Looking Through A Philosophical Lens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper examines ASR bias through aphilosophical lens, arguing that systematic misrecognition of certain speechvarieties constitutes more than a technical limitation — it represents a formof disrespect that compounds historical injustices against marginalizedlinguistic communities. |
Anna Seo Gyeong Choi; Hoon Choi; | arxiv-cs.CL | 2025-08-09 |
| 586 | Whisfusion: Parallel ASR Decoding Via A Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Whisfusion, the first framework to fuse a pre-trainedWhisper encoder with a text diffusion decoder. |
TAEYOUN KWON et. al. | arxiv-cs.SD | 2025-08-09 |
| 587 | A Study on Regularization-Based Continual Learning Methods for Indic ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We employ a Conformer-based hybrid RNN-T/CTC model,initially pretrained on Hindi, which is then incrementally trained on eightadditional Indian languages, for a total sequence of nine languages. |
Gokul Adethya T; S. Jaya Nirmala; | arxiv-cs.LG | 2025-08-08 |
| 588 | Improved Dysarthric Speech to Text Conversion Via TTS Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a case study on developing a customized speech-to-text system fora Hungarian speaker with severe dysarthria. |
PÉTER MIHAJLIK et. al. | arxiv-cs.SD | 2025-08-08 |
| 589 | SPGISpeech 2.0: Transcribed Multi-speaker Financial Audio for Speaker-tagged Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SPGISpeech 2.0, a dataset suitable for speaker-taggedtranscription in the financial domain. |
RAYMOND GROSSMAN et. al. | arxiv-cs.SD | 2025-08-07 |
| 590 | What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) atthe front end of their pipeline. The role of ASR in SDSs is to recognizeinformation in user speech related to response generation appropriately.Examining selective listening of humans, which refers to the ability to focuson and listen to important parts of a conversation during the speech, willenable us to identify the ASR capabilities required for SDSs and evaluate them.In this study, we experimentally confirmed selective listening when humansgenerate dialogue responses by comparing human transcriptions for generatingdialogue responses and reference transcriptions. |
KIYOTADA MORI et. al. | arxiv-cs.CL | 2025-08-06 |
| 591 | NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present NVSpeech, an integrated and scalable pipeline that bridgesthe recognition and synthesis of paralinguistic vocalizations, encompassingdataset construction, ASR modeling, and controllable TTS. |
HUAN LIAO et. al. | arxiv-cs.SD | 2025-08-06 |
| 592 | Efficient Scaling for LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Throughcomprehensive and controlled experiments, we find that pretraining the speechencoder before integrating it with the LLM leads to significantly betterscaling efficiency than the standard practice of joint post-training ofLLM-ASR. Based on this insight, we propose a new multi-stage LLM-ASR trainingstrategy, EFIN: Encoder First Integration. |
Bingshen Mu; Yiwen Shao; Kun Wei; Dong Yu; Lei Xie; | arxiv-cs.SD | 2025-08-06 |
| 593 | Pitch Accent Detection Improves Pretrained Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show the performance of Automatic Speech Recognition (ASR) systems thatuse semi-supervised speech representations can be boosted by a complimentarypitch accent detection module, by introducing a joint ASR and pitch accentdetection model. |
David Sasu; Natalie Schluter; | arxiv-cs.CL | 2025-08-06 |
| 594 | Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the issue, we propose a simple yeteffective Supervised Mixture of Experts (S-MoE). |
HOJUN JIN et. al. | arxiv-cs.CL | 2025-08-05 |
| 595 | RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented … |
Pengcheng Wang; Sheng Li; Takahiro Shinozaki; | ArXiv | 2025-08-05 |
| 596 | Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on ASRpost-correction, prior work requires 2 transcriptions, focuses only on isolatedequations, has a limited test set, and provides neither training data normultilingual coverage. To address these issues, we present the first fullyopen-source large-scale dataset, comprising over 66,000 human-annotated audiosamples of mathematical equations and sentences in both English and Russian,drawn from diverse scientific domains. |
DMITRII KORZH et. al. | arxiv-cs.CV | 2025-08-05 |
| 597 | Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a low-latency telecom AI voice agent pipeline for real-time,interactive telecommunications use, enabling advanced voice AI for call centerautomation, intelligent IVR (Interactive Voice Response), and AI-drivencustomer support. |
Vignesh Ethiraj; Ashwath David; Sidhanth Menon; Divya Vijay; | arxiv-cs.SD | 2025-08-05 |
| 598 | LLM-Enhanced Spoken Named Entity Recognition Leveraging ASR N-Best Hypotheses Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Identifying Personally Identifiable Information (PII) from spoken documents is crucial for privacy preservation in speech processing. Unlike written text, spoken language exhibits … |
Farha Azmi; Rong Tong; | 2025 International Conference on Asian Language Processing … | 2025-08-03 |
| 599 | Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around The Globe Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Voxlect, a novel benchmark for modeling dialects and regionallanguages worldwide using speech foundation models. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2025-08-03 |
| 600 | Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this approach is impractical for devices lacking hardwareaccelerators like GPUs. To address this, we propose \emph{Token Map Drafting},a model-free SD technique that eliminates the need for a separate draft model.Instead, we leverage a precomputed n-gram token map derived fromdomain-specific training data, enabling efficient speculative decoding withminimal overhead. |
Tuan Vu Ho; Hiroaki Kokubo; Masaaki Yamamoto; Yohei Kawaguchi; | arxiv-cs.CL | 2025-07-29 |
| 601 | The Interspeech 2025 Speech Accessibility Project Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hosted on EvalAI and leveraging the remote evaluation pipeline,the SAP Challenge evaluates submissions based on Word Error Rate and SemanticScore. |
XIUWEN ZHENG et. al. | arxiv-cs.AI | 2025-07-29 |
| 602 | Whisper Smarter, Not Harder: Adversarial Attack on Partial Suppression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, recent studies have demonstrated thepossibility of adversarial attack on these models which could potentiallysuppress or disrupt model output. We investigate and verify the robustness ofthese attacks and explore if it is possible to increase their imperceptibility.We additionally find that by relaxing the optimisation objective from completesuppression to partial suppression, we can further decrease theimperceptibility of the attack. |
Zheng Jie Wong; Bingquan Shen; | arxiv-cs.SD | 2025-07-29 |
| 603 | A Deep Learning Automatic Speech Recognition Model for Shona Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presented the development of a deep learning-based AutomaticSpeech Recognition system for Shona, a low-resource language characterized byunique tonal and grammatical complexities. |
Leslie Wellington Sirora; Mainford Mutandavari; | arxiv-cs.CL | 2025-07-28 |
| 604 | Self-Improvement for Audio Large Language Model Using Unlabeled Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a self-improvement methodcalled SI-SDA, leveraging the information embedded in large-model decoding toevaluate the quality of generated pseudo labels and then perform domainadaptation based on reinforcement learning optimization. |
Shaowen Wang; Xinyuan Chen; Yao Xu; | arxiv-cs.SD | 2025-07-27 |
| 605 | MLLM-based Speech Recognition: When and How Is Multimodality Beneficial? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on our prior work, this paper examines the conditions andmodel architectures under which multiple input modalities can improve automaticspeech recognition (ASR) accuracy in noisy environments. |
Yiwen Guan; Viet Anh Trinh; Vivek Voleti; Jacob Whitehill; | arxiv-cs.SD | 2025-07-25 |
| 606 | HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents HITSZ’s submission for the IWSLT 2025 Indic track,focusing on speech-to-text translation (ST) for English-to-Indic andIndic-to-English language pairs. |
XUCHEN WEI et. al. | arxiv-cs.CL | 2025-07-25 |
| 607 | The Eloquence Team Submission for Task 1 of MLC-SLM Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present our studies and experiments carried out for thetask 1 of the Challenge and Workshop on Multilingual Conversational SpeechLanguage Model (MLC-SLM), which focuses on advancing multilingualconversational speech recognition through the development of speech languagemodels architectures. |
Lorenzo Concina; Jordi Luque; Alessio Brutti; Marco Matassoni; Yuchen Zhang; | arxiv-cs.SD | 2025-07-25 |
| 608 | Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the impact of incorporating timestamp-basedalignment between Automatic Speech Recognition (ASR) transcripts and SpeakerDiarization (SD) outputs on Speech Emotion Recognition (SER) accuracy.Misalignment between these two modalities often reduces the reliability ofmultimodal emotion recognition systems, particularly in conversationalcontexts. |
Hsuan-Yu Wang; Pei-Ying Lee; Berlin Chen; | arxiv-cs.CL | 2025-07-25 |
| 609 | Phoneme-Level Visual Speech Recognition Via Point-Visual Fusion and Language Model Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existingmethods often aim to predict words or characters directly from visual cues, butthey commonly suffer from high error rates due to viseme ambiguity and requirelarge amounts of pre-training data. We propose a novel phoneme-based two-stageframework that fuses visual and landmark motion features, followed by an LLMmodel for word reconstruction to address these challenges. |
Matthew Kit Khinn Teng; Haibo Zhang; Takeshi Saitoh; | arxiv-cs.CV | 2025-07-24 |
| 610 | Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we explore the applicationof transformer-based models, specifically XLM-RoBERTa-large, to automaticallyrestore punctuation in unpunctuated Bangla text. |
Md Obyedullahil Mamun; Md Adyelullahil Mamun; Arif Ahmad; Md. Imran Hossain Emu; | arxiv-cs.CL | 2025-07-24 |
| 611 | Synthetic Voice Data for Automatic Speech Recognition in African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the first systematic assessment of large-scale syntheticvoice corpora for African ASR. |
Brian DeRenzi; Anna Dixon; Mohamed Aymane Farhi; Christian Resch; | arxiv-cs.CL | 2025-07-23 |
| 612 | Application of Whisper in Clinical Practice: The Post-Stroke Speech Assessment During A Naming Task Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we evaluate whether Whisper, a state-of-the-art ASRfoundation model, can be applied to transcribe and analyze speech from patientswith stroke during a commonly used picture-naming task. |
MILENA DAVUDOVA et. al. | arxiv-cs.SD | 2025-07-23 |
| 613 | BoSS: Beyond-Semantic Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present aformalized framework for BoSS, leveraging cognitive relevance theories andmachine learning models to analyze temporal and contextual speech dynamics. |
QING WANG et. al. | arxiv-cs.SD | 2025-07-23 |
| 614 | The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the TEA-ASLP’s system submitted to the MLC-SLM 2025Challenge, addressing multilingual conversational automatic speech recognition(ASR) in Task I and speech diarization ASR in Task II. |
Hongfei Xue; Kaixun Huang; Zhikai Zhou; Shen Huang; Shidong Shang; | arxiv-cs.SD | 2025-07-23 |
| 615 | Triple X: A LLM-Based Multilingual Speech Recognition System for The INTERSPEECH2025 MLC-SLM Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes our Triple X speech recognition system submitted to Task1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM)Challenge. |
Miaomiao Gao; Xiaoxiao Xiang; Yiwen Guo; | arxiv-cs.CL | 2025-07-23 |
| 616 | DNCASR: End-to-End Training for Speaker-Attributed ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. |
Xianrui Zheng; Chao Zhang; Phil Woodland; | acl | 2025-07-21 |
| 617 | MultiMed: Multilingual Medical Speech Recognition Via Attention Encoder Decoder IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. |
KHAI LE-DUC et. al. | acl | 2025-07-21 |
| 618 | Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e. g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. |
MINGFEI LAU et. al. | acl | 2025-07-21 |
| 619 | MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MISP-Meeting, a new real-world, multimodal dataset that covers subject-oriented long-form content. |
HangChen HangChen; Chao-Han Huck Yang; Jia-Chen Gu; Sabato Marco Siniscalchi; Jun Du; | acl | 2025-07-21 |
| 620 | GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. |
YIFAN YANG et. al. | acl | 2025-07-21 |
| 621 | Unifying Streaming and Non-streaming Zipformer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. |
BIDISHA SHARMA et. al. | acl | 2025-07-21 |
| 622 | Improving Language and Modality Transfer in Translation By Character-level Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a character-based approach to improve adaptability to new languages and modalities. |
Ioannis Tsiamas; David Dale; Marta R. Costa-jussà; | acl | 2025-07-21 |
| 623 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. |
JIAMING ZHOU et. al. | acl | 2025-07-21 |
| 624 | NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. |
YEN-TING LIN et. al. | acl | 2025-07-21 |
| 625 | Dialectal Coverage And Generalization in Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. |
Amirbek Djanibekov; Hawau Olamide Toyin; Raghad Alshalan; Abdullah Alatir; Hanan Aldarmaki; | acl | 2025-07-21 |
| 626 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. |
JENALEA RAJAB et. al. | acl | 2025-07-21 |
| 627 | WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. |
YIFU CHEN et. al. | acl | 2025-07-21 |
| 628 | That Doesn’t Sound Right: Evaluating Speech Transcription Quality in Field Linguistics Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore methods for identifying speech transcriptions in fieldwork data that may be unsuitable for training ASR models. |
Eric Le Ferrand; Bo Jiang; Joshua Hartshorne; Emily Prud’hommeaux; | acl | 2025-07-21 |
| 629 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thesedemographics possess distinct vocal characteristics, linguistic styles, andinteraction patterns that challenge conventional ASR, TTS, and LLM systems. Toaddress this, we introduce EchoVoices, an end-to-end digital human pipelinededicated to creating persistent digital personas for seniors and children,ensuring their voices and memories are preserved for future generations. |
HAIYING XU et. al. | arxiv-cs.SD | 2025-07-20 |
| 630 | Weak Supervision Techniques Towards Enhanced ASR Models in Industry-level CRM Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this process faces thechallenge of discerning customer voices and intentions, and general pre-trainedautomatic speech recognition (ASR) models make it difficult to effectivelyaddress industry-specific speech recognition tasks. To address this issue, weinnovatively proposed a solution for fine-tuning industry-specific ASR models,which significantly improved the performance of the fine-tuned ASR models inindustry applications. |
ZHONGSHENG WANG et. al. | arxiv-cs.SD | 2025-07-19 |
| 631 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a universal methodologyfor Arabic speech and text processing designed to address unique challenges ofthe language. |
Lilit Grigoryan; Nikolay Karpov; Enas Albasiri; Vitaly Lavrukhin; Boris Ginsburg; | arxiv-cs.CL | 2025-07-18 |
| 632 | Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we improve ASRfor Catalan-Spanish CS by exploring three strategies: (1) generating syntheticCS data, (2) concatenating monolingual audio, and (3) leveraging real CS datawith language tokens. |
CARLOS MENA et. al. | arxiv-cs.CL | 2025-07-18 |
| 633 | Reading Between The Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study integrates pause features with semanticcoherence metrics across three datasets: naturalistic self-recorded diaries(AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dreamnarratives (PsyCL, n = 43). |
FENG CHEN et. al. | arxiv-cs.CL | 2025-07-17 |
| 634 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-accessdataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotionalcategories. |
Maksim Borisov; Egor Spirin; Daria Diatlova; | arxiv-cs.LG | 2025-07-17 |
| 635 | Improving Contextual ASR Via Multi-grained Fusion with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel multi-grained fusion approach that jointlyleverages the strengths of both token-level and phrase-level fusion with LargeLanguage Models (LLMs). |
Shilin Zhou; Zhenghua Li; | arxiv-cs.CL | 2025-07-16 |
| 636 | WhisperKit: On-device Real-time ASR with Billion-Scale Transformers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Real-time Automatic Speech Recognition (ASR) is a fundamental building blockfor many commercial applications of ML, including live captioning, dictation,meeting transcriptions, … |
Atila Orhon; Arda Okan; Berkin Durmus; Zach Nagengast; Eduardo Pacheco; | arxiv-cs.AI | 2025-07-14 |
| 637 | DQLoRA: A Lightweight Domain-Aware Denoising ASR Via Adapter-guided Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a demo of DQLoRA, an Adapter-Guided Distillation framework forrobust speech recognition under low-resource and noisy conditions. |
Yiru Yang; | arxiv-cs.SD | 2025-07-14 |
| 638 | Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To supportthese on-the-ground efforts, the community is turning to digital technology.Automatic Speech Recognition (ASR) technology holds great promise foraccelerating language documentation and the creation of educational resources.However, developing ASR systems for SENCOTEN is challenging due to limited dataand significant vocabulary variation from its polysynthetic structure andstress-driven metathesis. To address these challenges, we propose an ASR-drivendocumentation pipeline that leverages augmented speech data from atext-to-speech (TTS) system and cross-lingual transfer learning with SpeechFoundation Models (SFMs). |
Mengzhe Geng; Patrick Littell; Aidan Pine; Marc Tessier; Roland Kuhn; | arxiv-cs.SD | 2025-07-14 |
| 639 | ClaritySpeech: Dementia Obfuscation in Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dementia, a neurodegenerative disease, alters speech patterns, creatingcommunication barriers and raising privacy concerns. Current speechtechnologies, such as automatic speech … |
Dominika Woszczyk; Ranya Aloufi; Soteris Demetriou; | arxiv-cs.CL | 2025-07-12 |
| 640 | ILT-Iterative LoRA Training Through Focus-Feedback-Fix for Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on Whisper-large-v3 andQwen2-Audio, we conduct systematic experiments using a three-stage trainingprocess: Focus Training, Feed Back Training, and Fix Training. |
Qingliang Meng; Hao Wu; Wei Liang; Wei Xu; Qing Zhao; | arxiv-cs.CL | 2025-07-11 |
| 641 | The Impact of Automatic Speech Transcription on Speaker Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct what is, to our knowledge, the firstcomprehensive study of the impact of automatic transcription on speakerattribution performance. |
Cristina Aggazzotti; Matthew Wiesner; Elizabeth Allyn Smith; Nicholas Andrews; | arxiv-cs.CL | 2025-07-11 |
| 642 | Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Built on an extension of the LLM compressiontoolkit, our framework integrates edge-ASR models, diverse advancedquantization algorithms, a unified calibration and evaluation data pipeline,with detailed analysis tools. |
CHEN FENG et. al. | arxiv-cs.SD | 2025-07-10 |
| 643 | Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by a growing research interest into automatic speech recognition(ASR), and the growing body of work for languages in which code-switching (CS)often occurs, we present a systematic literature review of code-switching inend-to-end ASR models. |
Maha Tufail Agro; Atharva Kulkarni; Karima Kadaoui; Zeerak Talat; Hanan Aldarmaki; | arxiv-cs.CL | 2025-07-10 |
| 644 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To increase thecooperation between experts in different layers and encourage greaterspecialization, we use a shared router across different MoE layers. |
Zijin Gu; Tatiana Likhomanenko; Navdeep Jaitly; | arxiv-cs.CL | 2025-07-08 |
| 645 | How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we compare different performance and bias measures,from literature and proposed, to evaluate state-of-the-art end-to-end ASRsystems for Dutch. |
Tanvina Patel; Wiebke Hutiri; Aaron Yi Ding; Odette Scharenborg; | arxiv-cs.CL | 2025-07-08 |
| 646 | Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing automatic speech recognition (ASR) research evaluate modelsusing in-domain datasets. |
MARK ATTA MENSAH et. al. | arxiv-cs.CL | 2025-07-03 |
| 647 | A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents an approach for collecting speech samples to buildAutomatic Speech Recognition (ASR) models for impaired speech, particularly,low-resource languages. |
SUMAYA AHMED SALIHS et. al. | arxiv-cs.CL | 2025-07-03 |
| 648 | Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through systematic fine-tuning and hyperparameteroptimization, including learning rate, epochs, and model checkpoint selection,we have compared the models based on Word Error Rate (WER), Character ErrorRate (CER), Training Time, and Computational Efficiency. |
Md Sazzadul Islam Ridoy; Sumi Akter; Md. Aminur Rahman; | arxiv-cs.CL | 2025-07-02 |
| 649 | Bypassing Audio ReCAPTCHA with Automatic Speech Recognition Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: CAPTCHAs are challenges designed to distinguish humans from automated bots. However, with the growing capabilities of Automatic Speech Recognition (ASR) models, these challenges … |
PAUL AUBRY et. al. | 2025 IEEE European Symposium on Security and Privacy … | 2025-06-30 |
| 650 | Mind The Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel training approach thatextends the semantic context of ASR models by adding overlapping contextwindows during training. |
Duygu Altinok; | arxiv-cs.CL | 2025-06-28 |
| 651 | Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using the Conformer architecture and various LLaMA models, wedemonstrate significant improvements in Word Error Rate (WER) on theLibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performancefor CTC-based ASR with minimal computational overhead. |
Duygu Altinok; | arxiv-cs.CL | 2025-06-28 |
| 652 | AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AURA (Agent for Understanding, Reasoning, andAutomated Tool Use), the first open-source, speech-native assistant capable ofcompleting complex, goal-driven tasks through dynamic tool invocation andmulti-turn conversation. |
Leander Melroy Maben; Gayathri Ganesh Lakshmy; Srijith Radhakrishnan; Siddhant Arora; Shinji Watanabe; | arxiv-cs.AI | 2025-06-28 |
| 653 | Mizo Automatic Speech Recognition: Leveraging Wav2vec 2.0 and XLS-R for Enhanced Accuracy in Low-Resource Language Processing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This study introduces a Mizo Automatic Speech Recognition (ASR) approach by fine-tuning Wav2vec 2.0 and XLS-R models. The research presents the newly developed Mizo speech … |
Andrew Bawitlung; S. Dash; Radha Mohan Pattanayak; | ACM Transactions on Asian and Low-Resource Language … | 2025-06-28 |
| 654 | Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in The Cockpit Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates and improves the transcription accuracyof cockpit conversations with Whisper models. |
Kartheek Kumar Reddy Nareddy; Sarah Ternus; Julia Niebling; | arxiv-cs.CL | 2025-06-27 |
| 655 | OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. |
WILLIAM CHEN et. al. | icml | 2025-06-25 |
| 656 | A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic … |
Phurich Saengthong; Boonnithi Jiaramaneepinit; Sheng Li; Manabu Okumura; Takahiro Shinozaki; | ArXiv | 2025-06-25 |
| 657 | SING: Spatial Context in Large Language Model for Next-Gen Wearables Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies.To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset by using the LibriSpeech dataset. |
Ayushi Mishra; Yang Bai; Priyadarshan Narayanasamy; Nakul Garg; Nirupam Roy; | icml | 2025-06-25 |
| 658 | Accurate, Fast, Cheap: Choose Three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present astrong limited-context attention (LCA) baseline, and show that RA layers arejust as accurate while being more efficient. |
Martin Ratajczak; Jean-Philippe Robichaud; Jennifer Drexler Fox; | arxiv-cs.CL | 2025-06-24 |
| 659 | End-to-End Spoken Grammatical Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: SGEC systems typicallyfollow a cascaded pipeline consisting of Automatic Speech Recognition (ASR),disfluency detection, and GEC, making them vulnerable to error propagationacross modules. This work examines an End-to-End (E2E) framework for SGEC andfeedback generation, highlighting challenges and possible solutions whendeveloping these systems. |
Mengjie Qian; Rao Ma; Stefano Bannò; Mark J. F. Gales; Kate M. Knill; | arxiv-cs.CL | 2025-06-23 |
| 660 | Our Collective Voices: The Social and Technical Values of A Grassroots Chinese Stuttered Speech Dataset Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The lack of authentic stuttered speech data has significantly limited the development of stuttering friendly automatic speech recognition (ASR) models. In previous work, we … |
Jingjin Li; Qisheng Li; Rong Gong; Lezhi Wang; Shaomei Wu; | Proceedings of the 2025 ACM Conference on Fairness, … | 2025-06-23 |
| 661 | Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition (ASR) systems. Despite recent … |
Niclas Pokel; Pehuen Moure; Roman Böhringer; Yingqiang Gao; | ArXiv | 2025-06-23 |
| 662 | Breaking The Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we benchmark theperformance of two fine-tuned multilingual ASR models, MMS and XLS-R, on fivetypologically diverse low-resource languages with control of training dataduration. |
Siyu Liang; Gina-Anne Levow; | arxiv-cs.CL | 2025-06-20 |
| 663 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reducing the frame rate is anatural solution, but standard techniques, such as rigid average pooling acrossframes, can distort or dilute the semantic structure required for effective LMalignment. To address this, we propose LM-SPT, a speech tokenization methodthat introduces a novel semantic distillation. |
Daejin Jo; Jeeyoung Yun; Byungseok Roh; Sungwoong Kim; | arxiv-cs.CL | 2025-06-20 |
| 664 | Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify twokey challenges: (1) Existing metrics do not adequately reflect intelligibility,and (2) while LLMs can refine ASR output, their effectiveness in correcting ASRtranscripts of dysarthric speech remains underexplored. |
Bornali Phukon; Xiuwen Zheng; Mark Hasegawa-Johnson; | arxiv-cs.LG | 2025-06-19 |
| 665 | Automatic Speech Recognition Biases in Newcastle English: An Error Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates ASR performance on NewcastleEnglish, a well-documented regional dialect known to be challenging for ASR. |
Dana Serditova; Kevin Tang; Jochen Steffens; | arxiv-cs.CL | 2025-06-19 |
| 666 | Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For long-form, we propose an algorithm using source separation as a vocal activity detector to derive segment boundaries, which results in a consistent reduction in WER relative to Whisper’s native long-form algorithm. |
Jaza Syed; Ivan Meresman Higgs; Ondřej Cífka; Mark Sandler; | arxiv-cs.SD | 2025-06-18 |
| 667 | Unifying Streaming and Non-streaming Zipformer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. |
BIDISHA SHARMA et. al. | arxiv-cs.SD | 2025-06-17 |
| 668 | AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. |
Tuan Nguyen; Huy-Dat Tran; | arxiv-cs.CL | 2025-06-17 |
| 669 | Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces the integration of language-specific bi-directionalcontext into a speech large language model (SLLM) to improve multilingualcontinuous conversational automatic speech recognition (ASR). |
Yizhou Peng; Hexin Liu; Eng Siong Chng; | arxiv-cs.CL | 2025-06-16 |
| 670 | Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition Via Soft Prompt Tuning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Large-scale multilingual ASR models like Whisper excel in high-resource settings but face challenges in low-resource scenarios, such as rare languages and code-switching (CS), due … |
Hongli Yang; Yizhou Peng; Hao Huang; Sheng Li; | ArXiv | 2025-06-16 |
| 671 | Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advancements in multilingual automatic speech recognition (ASR) have been driven by large-scale end-to-end models like Whisper. However, challenges such as language … |
Hongli Yang; Sheng Li; Hao Huang; Ayiduosi Tuohan; Yizhou Peng; | ArXiv | 2025-06-16 |
| 672 | NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This report details the NTU Speechlab system developed for the Interspeech2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge(Task I), where we achieved 5th place. |
YIZHOU PENG et. al. | arxiv-cs.CL | 2025-06-16 |
| 673 | Seewo’s Submission to MLC-SLM: Lessons Learned from Speech Reasoning Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. |
Bo Li; Chengben Xu; Wufeng Zhang; | arxiv-cs.CL | 2025-06-16 |
| 674 | SC-SOT: Conditioning The Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Speaker-Conditioned Serialized Output Training (SC-SOT), an enhanced SOT-based training for E2E multi-talker ASR. |
Yuta Hirano; Sakriani Sakti; | arxiv-cs.SD | 2025-06-14 |
| 675 | An Exploration of Mamba for Speech Self-Supervised Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. |
TZU-QUAN LIN et. al. | arxiv-cs.CL | 2025-06-14 |
| 676 | Adapting Whisper for Streaming Speech Recognition Via Two-Pass Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. |
HAORAN ZHOU et. al. | arxiv-cs.SD | 2025-06-13 |
| 677 | Enabling Automatic Transcription of Child-centered Audio Recordings from Real-world Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an approach to automatically detect those utterances in longform audio that can be reliably transcribed with modern ASR systems, allowing automatic and relatively accurate transcription of a notable proportion of all speech in typical longform data. |
Daniil Kocharov; Okko Räsänen; | arxiv-cs.SD | 2025-06-13 |
| 678 | (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of A Phonetically Balanced Speech Test Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Simulated Phoneme Speech Test (SimPhon Speech Test) methodology, a novel, multi-stage computational pipeline for the in silico design and validation of a phonetically balanced minimal-pair speech test. |
Stefan Bleeck; | arxiv-cs.SD | 2025-06-13 |
| 679 | Improving Named Entity Transcription with Contextual LLM-based Revision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. |
Viet Anh Trinh; Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-06-12 |
| 680 | Efficient Multilingual ASR Finetuning Via LoRA Language Experts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and … |
JIAHONG LI et. al. | ArXiv | 2025-06-11 |
| 681 | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. |
CHENG-KANG CHOU et. al. | arxiv-cs.CL | 2025-06-10 |
| 682 | Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we identify three pitfalls inexisting standard ASR auditing procedures, and demonstrate how addressing themimpacts audit results via a case study of six popular ASR systems’ performancefor aphasia speakers. |
Katelyn Xiaoying Mei; Anna Seo Gyeong Choi; Hilke Schellmann; Mona Sloane; Allison Koenecke; | arxiv-cs.CY | 2025-06-10 |
| 683 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. |
Rui Hu; Xiaolong Lin; Jiawang Liu; Shixi Huang; Zhenpeng Zhan; | arxiv-cs.CL | 2025-06-09 |
| 684 | DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. |
SOLEE IM et. al. | arxiv-cs.CL | 2025-06-09 |
| 685 | Technical Report: A Practical Guide to Kaldi ASR Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This technical report introduces innovative optimizations for Kaldi-based Automatic Speech Recognition (ASR) systems, focusing on acoustic model enhancement, hyperparameter tuning, and language model efficiency. |
Mengze Hong; Di Jiang; | arxiv-cs.SD | 2025-06-08 |
| 686 | Speech Recognition on TV Series with Video-guided Post-ASR Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches failto explicitly leverage the rich temporal and contextual information availablein the video. To address this limitation, we propose a Video-Guided Post-ASRCorrection (VPC) framework that uses a Video-Large Multimodal Model (VLMM) tocapture video context and refine ASR outputs. |
Haoyuan Yang; Yue Zhang; Liqiang Jing; John H. L. Hansen; | arxiv-cs.SD | 2025-06-08 |
| 687 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. |
JOYA CHEN et. al. | cvpr | 2025-06-07 |
| 688 | Automatic Speech Recognition of African American English: Lexical and Contextual Effects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study focuses on two key AAE variables: Consonant Cluster Reduction(CCR) and ING-reduction. |
Hamid Mojarad; Kevin Tang; | arxiv-cs.CL | 2025-06-07 |
| 689 | Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model’s output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. |
CHRISTOPHE VAN GYSEL et. al. | arxiv-cs.CL | 2025-06-06 |
| 690 | ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. |
Thai-Binh Nguyen; Thi Van Nguyen; Quoc Truong Do; Chi Mai Luong; | arxiv-cs.CL | 2025-06-05 |
| 691 | LLM-based Phoneme-to-grapheme for Phoneme-based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). |
Te Ma; Min Bi; Saierdaer Yusuyin; Hao Huang; Zhijian Ou; | arxiv-cs.SD | 2025-06-05 |
| 692 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Although state-of-the-art Speech Foundational Models can produce high-quality text pseudo-labels, applying Semi-Supervised Learning (SSL) for in-the-wild real-world data remains … |
Wen Ding; Fan Qian; | ArXiv | 2025-06-05 |
| 693 | Customizing Speech Recognition Model with Large Language Model Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose areinforcement learning based approach for unsupervised domain adaptation,leveraging unlabeled data to enhance transcription quality, particularly thenamed entities affected by domain mismatch, through feedback from a LLM. |
Shaoshi Ling; Guoli Ye; | arxiv-cs.CL | 2025-06-05 |
| 694 | Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. |
PRADEEP RANGAPPA et. al. | arxiv-cs.CL | 2025-06-04 |
| 695 | Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We systematically study how three different variables in training data — the number of speakers, the audio duration per each individual speaker, and the diversity of accents — affect ASR robustness towards unseen accents in a low-resource training regime. |
Zheng-Xin Yong; Vineel Pratap; Michael Auli; Jean Maillard; | arxiv-cs.CL | 2025-06-04 |
| 696 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using In-the-wild Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenges, we introduceLESS (Large Language Model Enhanced Semi-supervised Learning), a versatileframework that uses Large Language Models (LLMs) to correct pseudo-labelsgenerated on in-the-wild data. |
Wen Ding; Fan Qian; | arxiv-cs.CL | 2025-06-04 |
| 697 | Overcoming Data Scarcity in Multi-Dialectal Arabic ASR Via Whisper Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. |
Ömer Tarik Özyilmaz; Matt Coler; Matias Valdenegro-Toro; | arxiv-cs.CL | 2025-06-03 |
| 698 | A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. |
Verena Blaschke; Miriam Winkler; Constantin Förster; Gabriele Wenger-Glemser; Barbara Plank; | arxiv-cs.CL | 2025-06-03 |
| 699 | Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR Via Feature Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. |
Dennis Fucci; Marco Gaido; Matteo Negri; Mauro Cettolo; Luisa Bentivogli; | arxiv-cs.CL | 2025-06-02 |
| 700 | Whale: Large-Scale Multilingual ASR Model with W2v-BERT and E-Branchformer with Large Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports on the development of a large-scale speech recognition model, Whale. |
Yosuke Kashiwagi; Hayato Futami; Emiru Tsunoo; Satoshi Asakawa; | arxiv-cs.CL | 2025-06-02 |
| 701 | HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. |
AMIR HUSSEIN et. al. | arxiv-cs.CL | 2025-06-02 |
| 702 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data’s vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. |
Yu Nakagome; Michael Hentschel; | arxiv-cs.CL | 2025-06-01 |
| 703 | Fine-Tuning ASR for Stuttered Speech: Personalized Vs. Generalized Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, weinvestigate fine-tuning ASRs for stuttered speech, comparing generalized models(trained across multiple speakers) to personalized models tailored toindividual speech characteristics. |
Dena Mujtaba; Nihar Mahapatra; | arxiv-cs.SD | 2025-06-01 |
| 704 | PMF-CEC: Phoneme-Augmented Multimodal Fusion for Context-Aware ASR Error Correction With Error-Specific Selective Decoding Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error … |
Jiajun He; T. Toda; | IEEE Transactions on Audio, Speech and Language Processing | 2025-05-31 |
| 705 | Causal Structure Discovery for Error Diagnostics of Children’s ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. |
Vishwanath Pratap Singh; Md. Sahidullah; Tomi Kinnunen; | arxiv-cs.CL | 2025-05-31 |
| 706 | Chain-of-Thought Training for Open E2E Spoken Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)’s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. |
SIDDHANT ARORA et. al. | arxiv-cs.CL | 2025-05-31 |
| 707 | LID Models Are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis reveals a simple method to enhance model robustness to accents through input chunking. (iii) We present an approach that integrates sequence-level information into our model without relying on monolingual ASR systems; this reduces accent-language confusion and significantly enhances performance on accented speech while maintaining comparable results on standard LID. |
Niyati Bafna; Matthew Wiesner; | arxiv-cs.CL | 2025-05-31 |
| 708 | ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a pipeline that combines ASR and toxic spans detection for fine-grained identification of toxic content. |
Huy Ba Do; Vy Le-Phuong Huynh; Luan Thanh Nguyen; | arxiv-cs.CL | 2025-05-31 |
| 709 | Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. |
LUONG HO et. al. | arxiv-cs.CL | 2025-05-30 |
| 710 | Pseudo Labels-based Neural Speech Enhancement for The AVSR Task in The MISP-Meeting Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents our system for the MISP-Meeting Challenge Track 2. |
Longjie Luo; Shenghui Lu; Lin Li; Qingyang Hong; | arxiv-cs.SD | 2025-05-30 |
| 711 | MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). |
Dimitrios Damianos; Georgios Paraskevopoulos; Alexandros Potamianos; | arxiv-cs.CL | 2025-05-30 |
| 712 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. |
YANGUI FANG et. al. | arxiv-cs.CL | 2025-05-30 |
| 713 | Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. |
SUJEET KUMAR et. al. | arxiv-cs.CL | 2025-05-30 |
| 714 | SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. |
PENG XIE et. al. | arxiv-cs.CL | 2025-05-30 |
| 715 | Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. |
Qingzheng Wang; Jiancheng Sun; Yifan Peng; Shinji Watanabe; | arxiv-cs.SD | 2025-05-30 |
| 716 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. |
Zhennan Lin; Kaixun Huang; Wei Ren; Linju Yang; Lei Xie; | arxiv-cs.SD | 2025-05-29 |
| 717 | BeaverTalk: Oregon State University’s IWSLT 2025 Simultaneous Speech Translation System Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper discusses the construction, fine-tuning, and deployment of BeaverTalk, a cascaded system for speech-to-text translation as part of the IWSLT 2025 simultaneous translation task. |
Matthew Raffel; Victor Agostinelli; Lizhong Chen; | arxiv-cs.CL | 2025-05-29 |
| 718 | Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. |
Griffin Dietz Smith; Dianna Yee; Jennifer King Chen; Leah Findlater; | arxiv-cs.LG | 2025-05-29 |
| 719 | Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. |
YOUJUN CHEN et. al. | arxiv-cs.SD | 2025-05-29 |
| 720 | Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. |
Stefan Bleeck; | arxiv-cs.SD | 2025-05-28 |
| 721 | Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. |
MINGCHEN SHAO et. al. | arxiv-cs.SD | 2025-05-28 |
| 722 | Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposed an LLM-driven ASR-SED multi-task learning framework that jointly optimized the ASR and Stuttering Event Detection (SED) tasks. |
Shangkun Huang; Jing Deng; Jintao Kang; Rong Zheng; | arxiv-cs.SD | 2025-05-28 |
| 723 | AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark. |
YUHANG DAI et. al. | arxiv-cs.SD | 2025-05-28 |
| 724 | Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. |
SHANGKUN HUANG et. al. | arxiv-cs.SD | 2025-05-28 |
| 725 | Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. |
Titouan Parcollet; Yuan Tseng; Shucong Zhang; Rogier van Dalen; | arxiv-cs.CL | 2025-05-27 |
| 726 | Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. |
TIANYI XU et. al. | arxiv-cs.CL | 2025-05-27 |
| 727 | Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a novel memory-efficient model compression approach for Conformer ASR and speech foundation systems. |
ZHAOQING LI et. al. | arxiv-cs.SD | 2025-05-27 |
| 728 | Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. |
ZHAOQING LI et. al. | arxiv-cs.SD | 2025-05-27 |
| 729 | GMU Systems for The IWSLT 2025 Low-Resource Speech Translation Shared Task Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. |
Chutong Meng; Antonios Anastasopoulos; | arxiv-cs.CL | 2025-05-27 |
| 730 | Exploring Generative Error Correction for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). |
Moreno La Quatra; Alkis Koudounas; Valerio Mario Salerno; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-05-26 |
| 731 | In-context Language Learning for Endangered Languages in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). |
Zhaolin Li; Jan Niehues; | arxiv-cs.CL | 2025-05-26 |
| 732 | Automated Evaluation of Children’s Speech Fluency for Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. |
BOWEN ZHANG et. al. | arxiv-cs.SD | 2025-05-26 |
| 733 | Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. |
Dancheng Liu; Amir Nassereldine; Chenhui Xu; Jinjun Xiong; | arxiv-cs.CL | 2025-05-26 |
| 734 | KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. |
ZHAOLIN LI et. al. | arxiv-cs.CL | 2025-05-26 |
| 735 | Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. |
Raphaël Bagat; Irina Illina; Emmanuel Vincent; | arxiv-cs.CL | 2025-05-26 |
| 736 | BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. |
Xun Gong; Anqi Lv; Zhiming Wang; Huijia Zhu; Yanmin Qian; | arxiv-cs.SD | 2025-05-25 |
| 737 | SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. |
SUSHANT GAUTAM et. al. | arxiv-cs.CV | 2025-05-22 |
| 738 | An Effective Training Framework for Light-Weight Automatic Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. |
Abdul Hannan; Alessio Brutti; Shah Nawaz; Mubashir Noman; | arxiv-cs.CV | 2025-05-22 |
| 739 | LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. |
Natsuo Yamashita; Masaaki Yamamoto; Hiroaki Kokubo; Yohei Kawaguchi; | arxiv-cs.SD | 2025-05-22 |
| 740 | Large Language Models Based ASR Error Correction for Child Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. |
ANFENG XU et. al. | arxiv-cs.CL | 2025-05-22 |
| 741 | Differentiable K-means for Fully-optimized Discrete Token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. |
Kentaro Onda; Yosuke Kashiwagi; Emiru Tsunoo; Hayato Futami; Shinji Watanabe; | arxiv-cs.SD | 2025-05-22 |
| 742 | Prosodically Enhanced Foreign Accent Simulation By Discrete Token-based Resynthesis Only with Native Speech Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. |
Kentaro Onda; Keisuke Imoto; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2025-05-21 |
| 743 | Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: An Analytical Study Towards Accent-robust ASR Only with Native Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we gained insight that contributes to achieving accent-robust ASR using only native speech data. |
Kentaro Onda; Keisuke Imoto; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2025-05-21 |
| 744 | Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. |
Hongfei Xue; Yufeng Tang; Jun Zhang; Xuelong Geng; Lei Xie; | arxiv-cs.SD | 2025-05-21 |
| 745 | PersonaTAB: Predicting Personality Traits Using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. |
Sho Inoue; Shai Wang; Haizhou Li; | arxiv-cs.SD | 2025-05-20 |
| 746 | Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challengingdue to data scarcity, particularly in non-English languages. To address this,we fine-tune a voice … |
CHIN-JOU LI et. al. | arxiv-cs.CL | 2025-05-20 |
| 747 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2025-05-20 |
| 748 | ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. |
YU PAN et. al. | arxiv-cs.SD | 2025-05-19 |
| 749 | Improving Endpoint Detection in End-to-end Streaming ASR for Conversational Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose methods to improve EP by addressing delayed emission along with EP mistakes. |
ANANDH C et. al. | arxiv-cs.CL | 2025-05-19 |
| 750 | Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present our submission to the Speech Accessibility Project challenge for dysarthric speech recognition. |
DOMINIK WAGNER et. al. | arxiv-cs.SD | 2025-05-19 |
| 751 | ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. |
Anand Rai; Satyam Rahangdale; Utkarsh Anand; Animesh Mukherjee; | arxiv-cs.SD | 2025-05-16 |
| 752 | Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. |
Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-05-16 |
| 753 | LegoSLM: Connecting LLM with Speech Encoder Using CTC Posteriors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. |
Rao Ma; Tongzhou Chen; Kartik Audhkhasi; Bhuvana Ramabhadran; | arxiv-cs.CL | 2025-05-16 |
| 754 | Multi-Stage Speaker Diarization for Noisy Classrooms Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study investigates the effectiveness of multi-stage diarization models using Nvidia’s NeMo diarization pipeline. |
Ali Sartaz Khan; Tolulope Ogunremi; Ahmed Adel Attia; Dorottya Demszky; | arxiv-cs.SD | 2025-05-16 |
| 755 | Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. |
SUKAIRAJ HAFIZ IMAM et. al. | arxiv-cs.CL | 2025-05-16 |
| 756 | EvilHarmony: Stealthy Adversarial Attacks Against Black-Box Speech Recognition Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems are vulnerable to adversarial examples (AEs), where small, carefully designed perturbations are added to original audio to mislead the … |
XUEJING YUAN et. al. | 2025 IEEE Symposium on Security and Privacy (SP) | 2025-05-12 |
| 757 | Efficient Integration of ASR with Large Language Models to Enhance Video Search at Scale Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Currently, video is the most consumed content on the internet, with an increasing amount of information and knowledge shared as videos. This has led to a significant rise in video … |
QIANG ZHANG et. al. | Companion Proceedings of the ACM on Web Conference 2025 | 2025-05-08 |
| 758 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. |
Linrong Pan; Chenglong Jiang; Gaoze Hou; Ying Gao; | arxiv-cs.CL | 2025-05-08 |
| 759 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, we introduce a lightweight Multiple Cross-modal TokenPrediction (MCTP) module that efficiently generates multiple audio tokenswithin a single model forward pass, which not only accelerates the inferencebut also significantly reduces the latency for generating the first audio instreaming scenarios. |
ZUWEI LONG et. al. | arxiv-cs.CL | 2025-05-06 |
| 760 | Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. |
YANG LI et. al. | naacl | 2025-05-04 |
| 761 | Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Wav2Prompt uses a straightforward training process with only the same data used to train an automatic speech recognition (ASR) model. |
Keqi Deng; Guangzhi Sun; Phil Woodland; | naacl | 2025-05-04 |
| 762 | AMPS: ASR with Multimodal Paraphrase Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a new technique AMPS, that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | naacl | 2025-05-04 |
| 763 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | naacl | 2025-05-04 |
| 764 | Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To perform a controlled architectural comparison, we train all models from scratch rather than using large pretrained models and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1. |
Tsz Kin Lam; Marco Gaido; Sara Papi; Luisa Bentivogli; Barry Haddow; | naacl | 2025-05-04 |
| 765 | Medical Spoken Named Entity Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present *VietMed-NER* – the first spoken NER dataset in the medical domain. |
KHAI LE-DUC et. al. | naacl | 2025-05-04 |
| 766 | BERSting at The Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Here we present theB(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. |
PAIGE TUTTÖSÍ et. al. | arxiv-cs.CL | 2025-04-30 |
| 767 | Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). |
Máté Gedeon; | arxiv-cs.CL | 2025-04-30 |
| 768 | “It Feels Like We’re Not Meeting The Criteria”: Examining and Mitigating The Cascading Effects of Bias in Automatic Speech Recognition in Spoken Language Interfaces Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Researchers have demonstrated that Automatic Speech Recognition (ASR) systems perform differently across demographic groups (i.e. show bias), yet their downstream impact on spoken … |
Kelechi Ezema; Chelsea Chandler; Rosy Southwell; Niranjan Cholendiran; Sidney D’Mello; | Proceedings of the 2025 CHI Conference on Human Factors in … | 2025-04-25 |
| 769 | Live: Learning Video LLM with Streaming Speech Transcription at Scale Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary APIs (e.g., GPT-4o) to produce training data, which limits their training … |
JOYA CHEN et. al. | 2025 IEEE/CVF Conference on Computer Vision and Pattern … | 2025-04-22 |
| 770 | Chinese-LiPS: A Chinese Audio-visual Speech Recognition Dataset with Lip-reading and Presentation Slides Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. |
JINGHUA ZHAO et. al. | arxiv-cs.MM | 2025-04-21 |
| 771 | Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI – Speaker Dependent and Speaker Independent AAI, (b) objectives of the work – Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, (c) Corpus – Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models – recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation – as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). |
Leena G Pillai; D. Muhammad Noorul Mubarak; | arxiv-cs.SD | 2025-04-17 |
| 772 | Dysarthria Normalization Via Local Lie Group Transformations for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a geometry-driven method for normalizing dysarthric speech by modeling time, frequency, and amplitude distortions as smooth, local Lie group transformations of spectrograms. |
Mikhail Osipov; | arxiv-cs.SD | 2025-04-16 |
| 773 | Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. |
MAHMOUD SALHAB et. al. | arxiv-cs.AI | 2025-04-16 |
| 774 | Can RAG-Driven Enhancements Amplify Audio LLMs for Low-Resource Languages? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work offers critical insights into the challenges of using LALMs for low-resource languages and provides a foundation for developing more inclusive and adaptable AI systems for complex multilingual tasks. |
B. Dutta; R. Ranjan; A. Jain; R. Singh; M. Vatsa; | icassp | 2025-04-15 |
| 775 | Dysarthric Speech Conformer: Adaptation for Sequence-to-Sequence Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a two-phase adaptation pipeline based on the Conformer architecture that leverages typical speech to transfer to individualized ASR models for dysarthric speakers. |
Q. Wang; | icassp | 2025-04-15 |
| 776 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
Z. Zhuang; | icassp | 2025-04-15 |
| 777 | Weakly Supervised Phonological Features for Pathological Speech Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a weakly supervised training method which exploits the known acoustic properties of phonemes by training an ASR model with an interpretable frame-level phonological feature bottleneck layer. |
J. Thienpondt; G. Vanderreydt; A. Hammami; K. Demuynck; | icassp | 2025-04-15 |
| 778 | Cohort-Sensitive Labeling: An Effective Approach for Enhancing ASR Performance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a cohort-sensitive labeling (CSL) for automatic speech recognition (ASR). |
J. Na; M. Hasegawa-Johnson; B. Lee; | icassp | 2025-04-15 |
| 779 | Developing A Multilingual Dataset and Evaluation Metrics for Code-Switching: A Focus on Hong Kong’s Polylingual Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We have introduced a novel evaluation metric called Fidelity to the Original Audio, Accuracy, and Latency (FAL). |
P. Xie; K. Chen; | icassp | 2025-04-15 |
| 780 | EFL-PEFT: A Communication Efficient Federated Learning Framework Using PEFT Sparsification for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we consolidate the use of PEFT for ASR with pre-trained models, demonstrating that it enables efficient FL reducing the amount of parameters to share with respect to full fine-tuning. |
M. N. Ali; D. Falavigna; A. Brutti; | icassp | 2025-04-15 |
| 781 | HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named HDMoLE, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixture of experts (MoE) and can be generalized to any linear layer. |
B. Mu; K. Wei; Q. Shao; Y. Xu; L. Xie; | icassp | 2025-04-15 |
| 782 | Speech Few-Shot Learning for Language Learners’ Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports how speech recognition accuracy can be improved using the speech few-shot in-context learning capabilities of a multimodal foundation model when applied to the speech of language learners. |
J. Cheng; S. Nguyen; | icassp | 2025-04-15 |
| 783 | AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition Using Agnostic Contrastive Mixup Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). |
C. Carvalho; A. Abad; | icassp | 2025-04-15 |
| 784 | Chinese Speech Processing Via Chinese Character Feature Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper focuses on the basic structure of Chinese characters: semantic-phonetic compound characters. This paper takes advantage of this feature of Chinese characters and innovatively proposes a Chinese speech-processing method based on character shape. |
R. Jiang; Z. Yang; W. Xi; X. Fu; J. Zhao; | icassp | 2025-04-15 |
| 785 | M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. |
Y. Yang; | icassp | 2025-04-15 |
| 786 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
C. -C. WANG et. al. | icassp | 2025-04-15 |
| 787 | META-CAT: Speaker-Informed Speech Embeddings Via Meta Information Concatenation for Multi-talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. |
J. Wang; | icassp | 2025-04-15 |
| 788 | Task Vector Arithmetic for Low-Resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the performance is heavily dependent on the amount of pre-training data available for the target language, and fine-tuning with limited data can lead to overfitting and may have some rooms to optimize. To address these challenges, we propose task vector-based adaptation to improve the performance of automatic speech recognition (ASR) in low-resource languages. |
H. Nagasawa; S. Otake; S. Iwata; | icassp | 2025-04-15 |
| 789 | Mamba for Streaming ASR Combined with Unimodal Aggregation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. |
Y. Fang; X. Li; | icassp | 2025-04-15 |
| 790 | Token-Level Contextual Network with Ladder-Shaped Attention for End-to-End ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: More importantly, we propose a creative approach to address the challenge of the increasing size of expanding token list compared to phrase list. |
M. Fang; | icassp | 2025-04-15 |
| 791 | Injecting Visual Features Into Whisper for Parameter-Efficient Noise-Robust Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This gap highlights the need for more efficient methods to leverage visual and acoustic information in AVSR tasks. To address this challenge, we propose AVWhisper, a parameter-efficient model that integrates visual and acoustic representations by injecting visual features from the AV-HuBERT encoder into the pre-trained Whisper model. |
Z. Yang; | icassp | 2025-04-15 |
| 792 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Z. Tang; D. Wang; S. Huang; S. Shang; | icassp | 2025-04-15 |
| 793 | Enhancing Low-Resource ASR Through Versatile TTS: Bridging The Data Gap IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. |
G. Yang; | icassp | 2025-04-15 |
| 794 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
H. Wang; | icassp | 2025-04-15 |
| 795 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
B. Yan; V. Pratap; S. Watanabe; M. Auli; | icassp | 2025-04-15 |
| 796 | Revise, Reason, and Recognize: LLM-Based Emotion Recognition Via Emotion-Specific Prompts and ASR Error Correction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. |
Y. Li; Y. Gong; C. -H. H. Yang; P. Bell; C. Lai; | icassp | 2025-04-15 |
| 797 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
A. S. Gyeong Choi; J. Park; M. Oh; | icassp | 2025-04-15 |
| 798 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
U. Cappellazzo; | icassp | 2025-04-15 |
| 799 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversations between adults. |
G. Maheshwari; D. Ivanov; T. Johannet; K. El Haddad; | icassp | 2025-04-15 |
| 800 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
Z. Cui; | icassp | 2025-04-15 |
| 801 | XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. |
S. Kumar; | icassp | 2025-04-15 |
| 802 | Towards A Single ASR Model That Generalizes to Disordered Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates the impact of integrating a dataset of disordered speech recordings (~1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. |
J. Tobin; K. Tomanek; S. Venugopalan; | icassp | 2025-04-15 |
| 803 | SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). |
D. Wagner; A. Churchill; S. Sigtia; E. Marchi; | icassp | 2025-04-15 |
| 804 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
S. Basak; A. Padarthi; M. Gowda; | icassp | 2025-04-15 |
| 805 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
J. Zhao; | icassp | 2025-04-15 |
| 806 | Joint Training Framework for Accent and Speech Recognition Based on Conformer Low-Rank Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study introduces the Conformer Low-rank Adaptation for Joint Accent and Speech Recognition (CLAnSR), employing LoRA to augment both ASR and AR capabilities using a shared pre-trained base encoder. |
X. Zhuang; Y. Qian; S. Xu; M. Wang; | icassp | 2025-04-15 |
| 807 | Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps. |
N. L. Kühne; | icassp | 2025-04-15 |
| 808 | Speech Recognition with LLMs Adapted to Disordered Speech Using Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. |
C. NAGPAL et. al. | icassp | 2025-04-15 |
| 809 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) scenario in Automatic Speech Recognition (ASR). |
F. ZHANG et. al. | icassp | 2025-04-15 |
| 810 | Alignment-Free Training for Transducer-based Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. |
T. Moriya; | icassp | 2025-04-15 |
| 811 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
T. Feng; | icassp | 2025-04-15 |
| 812 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
T. -B. Nguyen; A. Waibel; | icassp | 2025-04-15 |
| 813 | AMuSE: Attentive Multilingual Speech Encoding for Zero-Prior ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on previous studies where models were trained to handle mixed-prior (knowledge that the underlying language belongs to a known group), we propose Attentive Multilingual Speech Encoding (AMuSE), a training framework designed to match exact-prior performance even in the absence of underlying language information at runtime (zero-prior), thereby making the model prior-agnostic. |
A. Varshney; | icassp | 2025-04-15 |
| 814 | Large Language Model Should Understand Pinyin for Chinese ASR Error Correction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Pinyin-enhanced GEC (PY-GEC), which leverages Pinyin—the phonetic representation of Mandarin Chinese—as supplementary information to improve Chinese ASR error correction. |
Y. Li; | icassp | 2025-04-15 |
| 815 | ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge of high annotation costs in text-to-speech (TTS) generation, this paper introduces a semi-supervised learning framework specifically designed for low-resource TTS scenarios. |
F. Li; S. Chen; H. Yang; S. Yuan; | icassp | 2025-04-15 |
| 816 | Speech Recognition for Automatically Assessing Afrikaans and IsiXhosa Preschool Oral Narratives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. |
C. JACOBS et. al. | icassp | 2025-04-15 |
| 817 | Self-Information Guided Speech Segmentation for Efficient Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel method that leverages self-information, a measure of the information contained within an utterance, as a supervisory signal for speech segmentation. |
W. S. Teo; Y. Minami; | icassp | 2025-04-15 |
| 818 | Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). |
N. Moritz; | icassp | 2025-04-15 |
| 819 | Faster Speech-LLaMA Inference with Multi-token Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. |
D. Raj; G. Keren; J. Jia; J. Mahadeokar; O. Kalinli; | icassp | 2025-04-15 |
| 820 | ValSub: Subsampling Validation Data to Mitigate Forgetting During ASR Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. |
H. Mehmood; | icassp | 2025-04-15 |
| 821 | Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel contextual biasing approach that employs the Multi-label Synchronous Output CTC (MCTC) algorithm to enhance the synchronization between ASR and bias task outputs. |
M. Fang; | icassp | 2025-04-15 |
| 822 | Chain-of-Thought Prompting for Speech Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
K. Hu; | icassp | 2025-04-15 |
| 823 | Generating Targeted Universal Adversarial Perturbation Against Automatic Speech Recognition Via Phoneme Tailoring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, to improve attack ability, we propose a Diverse Audio Composition Enrichment method, which enhances the utilization of audio features through phoneme-level slicing and recombination. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
| 824 | Can Automated Speech Recognition Errors Provide Valuable Clues for Alzheimer’s Disease Detection? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Finally, we conduct an interpretability study, including linguistic and SHapley Additive exPlanations (SHAP) analyses. This study reveals that greater word distribution differences between AD and healthy control (HC) groups in ASR transcripts may be linked to these valuable clues. |
Y. -L. Liu; | icassp | 2025-04-15 |
| 825 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
L. Meng; | icassp | 2025-04-15 |
| 826 | Harnessing The Zero-Shot Power of Instruction-Tuned Large Language Model for Guiding End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). |
Y. Higuchi; T. Ogawa; T. Kobayashi; | icassp | 2025-04-15 |
| 827 | Dynamic Language Group-based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
H. Huang; | icassp | 2025-04-15 |
| 828 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
J. Zhou; | icassp | 2025-04-15 |
| 829 | Retrieval Augmented Correction of Named Entity Speech Recognition Errors IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. |
E. Pusateri; | icassp | 2025-04-15 |
| 830 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
K. Mundnich; | icassp | 2025-04-15 |
| 831 | Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classification alongside ASR output. |
C. Y. KWOK et. al. | icassp | 2025-04-15 |
| 832 | Speech Recognition Rescoring with Large Speech-Text Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose novel techniques to use multi-modal LLM for ASR rescoring. |
P. G. Shivakumar; | icassp | 2025-04-15 |
| 833 | Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to decouple the training of the multi-channel speaker separation frontend and the ASR backend, with the latter trained only on clean speech. |
Y. Yang; H. Taherian; V. A. Kalkhorani; D. Wang; | icassp | 2025-04-15 |
| 834 | Advancing Streaming ASR with Chunk-wise Attention and Trans-chunk Selective State Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores enhancing streaming speech recognition through the integration of chunk-wise attention and selective state space models (SSMs). |
M. Mimura; T. Moriya; K. Matsuura; | icassp | 2025-04-15 |
| 835 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
T. Hori; M. Kocour; A. Haider; E. McDermott; X. Zhuang; | icassp | 2025-04-15 |
| 836 | Improved Recognition of The Speech of People with Parkinson’s Who Stutter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a novel stuttered speech data augmentation approach to improve dysarthric speech recognition. |
J. Na; X. Zheng; B. Lee; M. Hasegawa-Johnson; | icassp | 2025-04-15 |
| 837 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. |
A. A. Attia; D. Demszky; T. Ògúnremí; J. Liu; C. Espy-Wilson; | icassp | 2025-04-15 |
| 838 | Using Corrected ASR Projection to Improve AD Recognition Performance from Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, Automatic Speech Recognition transcription errors, stemming from language impairments in AD and Mild Cognitive Impairment patients, can lead to information loss during feature extraction. To mitigate this, we introduce the Corrected ASR Projecting, CAP model. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
| 839 | Adaptive Decoding for Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an adaptive decoding method (ADD) to reduce the latency. |
X. Ma; | icassp | 2025-04-15 |
| 840 | From Characters to Subwords: Modeling Unit Conversion for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel low-resource ASR method that leverages the advantages of two different modeling units. |
Y. Wang; H. Zhang; H. Wang; L. Sun; M. Song; | icassp | 2025-04-15 |
| 841 | Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, a Fast estimator for WER (Fe-WER) is introduced, utilizing average pooling over self-supervised learning representations for speech and text. |
C. Park; C. Lu; M. Chen; T. Hain; | icassp | 2025-04-15 |
| 842 | Speech Re-Painting for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce speech re-painting, a method for in-context augmented synthesis, using target training datasets to generate new utterances guided by speech and text on the fly in a zero-shot manner. |
K. Kastner; | icassp | 2025-04-15 |
| 843 | Target Speaker ASR with Whisper IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. |
A. POLOK et. al. | icassp | 2025-04-15 |
| 844 | Transducer-Llama: Integrating LLMs Into Streamable Transducer-based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. |
K. Deng; | icassp | 2025-04-15 |
| 845 | Learning Rich Speech Representations with Acoustic-Semantic Factorization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, the entanglement of acoustic and semantic information can undermine model robustness, particularly in varied acoustic environments. To address these issues, we propose a two-branch multitask finetuning strategy that integrates Automatic Speech Recognition and transcript-aligned audio reconstruction, designed to preserve and disentangle semantic and acoustic information in a final layer of a pretrained model. |
M. Niu; | icassp | 2025-04-15 |
| 846 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
T. Parcollet; R. van Dalen; S. Zhang; S. Bhattacharya; | icassp | 2025-04-15 |
| 847 | Speech Enhancement with MAP-based Training for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hence, this study proposes a maximum a posteriori (MAP) algorithm for training SE models by incorporating the posterior probability of clean speech, given the enhanced speech, into the loss function. |
Y. -J. Li; R. Chao; B. Su; Y. Tsao; | icassp | 2025-04-15 |
| 848 | Audio Diffusion with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore an alternate approach to the popular method of using large language models (LLMs) as a second decoder for Automated Speech Recognition (ASR) and speech understanding tasks. |
Y. Huang; K. Kastner; K. Audhkhasi; B. Ramabhadran; A. Rosenberg; | icassp | 2025-04-15 |
| 849 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
T. Abu; Y. Shi; T. F. Zheng; D. Wang; | icassp | 2025-04-15 |
| 850 | Contextualization of ASR with LLM Using Phonetic Retrieval-based Augmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrievalbased solution to contextualize the LLM: we first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding. |
Z. Lei; | icassp | 2025-04-15 |
| 851 | Identifying and Mitigating Mismatched Language Code in Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This language mismatch can significantly reduce ASR quality. We present a technique to identify and mitigate this issue. |
J. Kim; | icassp | 2025-04-15 |
| 852 | Contextual ASR with Retrieval Augmented Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose leveraging large language models (LLMs) and retrieval-augmented generation (RAG) to enhance the contextual capabilities of ASR systems. |
C. Xiao; Z. Hou; D. Garcia-Romero; K. J. Han; | icassp | 2025-04-15 |
| 853 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
N. B. Shankar; Z. Wang; E. Eren; A. Alwan; | icassp | 2025-04-15 |
| 854 | A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. |
Y. Jiang; B. Tian; | icassp | 2025-04-15 |
| 855 | AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we propose AdaCS, a normalization model integrates an adaptive bias attention module (BAM) into encoder-decoder network. |
T. C. Chu; V. Tuan Dat Pham; T. K. Dao; N. Hoang Nguyen; S. Truong; | icassp | 2025-04-15 |
| 856 | Unveiling Performance Bias in ASR Systems: A Study on Gender, Age, Accent, and More Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study examined the performance of twenty variants of seven Automatic Speech Recognition models across four datasets in English language: L2 Arctic, Speech Accent Archive, CORAAL, and SBCSAE. |
M. Jahan; | icassp | 2025-04-15 |
| 857 | UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, training large ASR models from scratch remains costly. To address this issue, we introduce UME, a novel method that efficiently Upcycles pretrained dense ASR checkpoints into larger Mixture-of-Eperts (MoE) architectures. |
L. FU et. al. | icassp | 2025-04-15 |
| 858 | PersoDA: Personalized Data Augmentation for Personalized ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user’s data utilized to personalize ASR [1] –[3]. |
P. P. Parada; | icassp | 2025-04-15 |
| 859 | Continuously Learning New Words in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. |
C. Huber; A. Waibel; | icassp | 2025-04-15 |
| 860 | Scaling Multilingual Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the performance in other languages is lagging far behind, due to the lack of labeled multilingual video data. In this work, we reduce the performance gap with the help of three key advances: (i) introducing the largest multilingual lip-reading dataset to date, (ii) proposing a single multi-task architecture that can perform two tasks simultaneously: identify the language and transcribe the utterance, and (iii) jointly training this architecture on all the languages together, resulting in large WER improvements as opposed to training monolingual models separately. |
K. R. Prajwal; S. Hegde; A. Zisserman; | icassp | 2025-04-15 |
| 861 | Efficient Long-Form Speech Recognition for General Speech In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. |
H. Yen; S. Ling; G. Ye; | icassp | 2025-04-15 |
| 862 | CJST: CTC Compressor Based Joint Speech and Text Training for Decoder-Only ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. |
W. Zhou; J. Jia; L. Sari; J. Mahadeokar; O. Kalinli; | icassp | 2025-04-15 |
| 863 | Regarding The Existence of The Internal Language Model in CTC-Based E2E ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the existence and strength of an ILM in CTC systems. |
Z. Zhao; P. Bell; | icassp | 2025-04-15 |
| 864 | LLM Based Text Generation for Improved Low-resource Speech Recognition Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prompting a large language model (LLM) to paraphrase input text can generate novel text data that is constrained to be semantically similar to the source data. We leverage this capability of LLMs to improve the performance of low-resource ASR systems by increasing the limited text training data while keeping the same spoken style. |
T. Nagano; | icassp | 2025-04-15 |
| 865 | Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While various methods have enhanced recognition accuracy in SE-ASR systems, they often require fine-tuning or re-training of SE or ASR models, which is impractical in many real-world applications. In this paper, we propose a lightweight distortion suppression (DS) network that addresses these artifacts without modifying the SE or ASR models, treating them as fixed black boxes. |
W. Wang; S. Zhao; Y. Qian; | icassp | 2025-04-15 |
| 866 | EPIC: Error Pattern Informed Correction for Classroom ASR with Limited Labeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the unique structure of classroom dialogue, existing ASR systems often struggle to accurately recognize and organize spoken utterances, creating significant challenges for downstream tasks in educational dialogue analysis. To address this issue, we propose EPIC, a post-processing framework for classroom ASR error correction. |
L. Jia; H. Sun; Y. Wei; C. Qi; X. Yang; | icassp | 2025-04-15 |
| 867 | Bridging The Modality Gap for Speech-image Retrieval with Text Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to leverage text supervision to facilitate the alignment between speech and image feature spaces via an automatic speech recognition (ASR) auxiliary task. |
Y. Yang; L. Zhou; Y. Li; G. Ma; | icassp | 2025-04-15 |
| 868 | Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). |
G. CHENG et. al. | icassp | 2025-04-15 |
| 869 | Enhancing Multilingual ASR for Unseen Languages Via Language Embedding Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, despite its success, Whisper struggles with unseen languages, which are not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose a method that exploits these relationships to improve ASR performance of Whisper in unseen languages. |
S. -S. Huang; K. -P. Huang; A. T. Liu; H. -Y. Lee; | icassp | 2025-04-15 |
| 870 | Automatic Speech Recognition and Spoken Language Understanding of Maritime Radio Communications: A Case Study with Singapore Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents several contributions designed to improve the ASR and the SLU systems by releasing a dataset for ASR and SLU task in maritime domain. |
P. Dat; J. M. Madhathil; T. Huy Dat; | icassp | 2025-04-15 |
| 871 | Enhancing Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). |
R. AMOOIE et. al. | icassp | 2025-04-15 |
| 872 | LLM Supervised Pre-training for Multimodal Emotion Recognition in Conversations IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. |
S. Dutta; S. Ganapathy; | icassp | 2025-04-15 |
| 873 | Improving Dialect Identification in Indian Languages Using Multimodal Features from Dialect Informed ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel multimodal architecture that leverages speech and text features to enhance DID performance. |
icassp | 2025-04-15 | |
| 874 | StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose StableQuant, a novel adaptive post-training quantization (PTQ) algorithm for widely used speech foundation models (SFMs). |
Y. Hong; H. Han; W. -J. Chung; H. -G. Kang; | icassp | 2025-04-15 |
| 875 | Revisiting Acoustic Features for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. |
M. A. Shah; B. Raj; | icassp | 2025-04-15 |
| 876 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. |
L. Grigoryan; N. Karpov; E. Albasiri; V. Lavrukhin; B. Ginsburg; | icassp | 2025-04-15 |
| 877 | Adopting Whisper for Confidence Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. |
V. Aggarwal; S. S. Nair; Y. Verma; Y. Jogi; | icassp | 2025-04-15 |
| 878 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
E. Sarkar; M. Magimai.-Doss; | icassp | 2025-04-15 |
| 879 | Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. |
R. Nasretdinov; R. Korostik; A. Jukić; | icassp | 2025-04-15 |
| 880 | Speech Data Selection for Efficient ASR Fine-Tuning Using Domain Classifier and Pseudo-Label Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In real-world speech data processing, the scarcity of annotated data and the abundance of unlabelled speech data present a significant challenge. To address this, we propose an efficient data selection pipeline for fine-tuning ASR models by generating pseudo-labels using WhisperX pipeline and selecting efficient labels for fine-tuning. |
P. Rangappa; | icassp | 2025-04-15 |
| 881 | Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) model. |
R. Fukuda; T. Kano; A. Ando; A. Ogawa; | icassp | 2025-04-15 |
| 882 | Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2×), and training time speedups (by 1.3× for ASR and 1.15× for AVSR). |
A. Fernandez-Lopez; S. Liu; L. Yin; S. Petridis; M. Pantic; | icassp | 2025-04-15 |
| 883 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose M2R-Whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
J. Zhou; | icassp | 2025-04-15 |
| 884 | Speech Retrieval-Augmented Generation Without Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. |
D. J. MIN et. al. | icassp | 2025-04-15 |
| 885 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
M. BARAŃSKI et. al. | icassp | 2025-04-15 |
| 886 | Spatial Audio Processing with Large Language Model on Wearable Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. |
Ayushi Mishra; Yang Bai; Priyadarshan Narayanasamy; Nakul Garg; Nirupam Roy; | arxiv-cs.SD | 2025-04-11 |
| 887 | Visual-Aware Speech Recognition for Noisy Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. |
Lakshmipathi Balaji; Karan Singla; | arxiv-cs.CL | 2025-04-09 |
| 888 | DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. |
XINGLIN LYU et. al. | arxiv-cs.CL | 2025-04-07 |
| 889 | IndicST: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The integration of speech modalities into large language models, known as Speech LLMs, is a promising area of research for applications like automatic speech recognition (ASR) and … |
Sanket Shah; Kavya Ranjan Saxena; Kancharana Manideep Bharadwaj; Sharath Adavanne; Nagaraj Adiga; | 2025 IEEE International Conference on Acoustics, Speech, … | 2025-04-06 |
| 890 | Selective Masking Adversarial Attack on Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. |
ZHENG FANG et. al. | arxiv-cs.CR | 2025-04-06 |
| 891 | LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect’s linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets — comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. |
Hedi Naouara; Jean-Pierre Lorré; Jérôme Louradour; | arxiv-cs.CL | 2025-04-03 |
| 892 | Chain of Correction for Full-text Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, challenges remainregarding stability, controllability, completeness, and fluency. To mitigatethese issues, this paper proposes the Chain of Correction (CoC), which uses amulti-turn chat format to correct errors segment by segment, guided bypre-recognized text and full-text context for better semantic understanding.Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM toevaluate CoC’s performance. |
ZHIYUAN TANG et. al. | arxiv-cs.CL | 2025-04-02 |
| 893 | Whispering Under The Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. |
WEIFEI JIN et. al. | arxiv-cs.CR | 2025-04-01 |
| 894 | Wireless Mobile Network with Transfer Learning Algorithm for Multilingual Education and Historical Research Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Notwithstanding recent advancements in Automatic Speech Recognition (ASR), acknowledging children’s speech continues to pose a considerable problem. This mainly results from … |
BADIA MUKHITDINOVA et. al. | J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl. | 2025-03-31 |
| 895 | The Impact of Code-switched Synthetic Data Quality Is Task Dependent: Insights from MT and ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. |
Injy Hamed; Ngoc Thang Vu; Nizar Habash; | arxiv-cs.CL | 2025-03-30 |
| 896 | Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. |
Xabier de Zuazo; Eva Navas; Ibon Saratxaga; Inma Hernáez Rioja; | arxiv-cs.CL | 2025-03-30 |
| 897 | Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. |
YANGYANG MENG et. al. | arxiv-cs.CL | 2025-03-26 |
| 898 | FinAudio: A Benchmark for Audio Large Language Models in Financial Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. |
YUPENG CAO et. al. | arxiv-cs.CE | 2025-03-26 |
| 899 | Boosting The Transferability of Audio Adversarial Examples with Acoustic Representation Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based attacks unfeasible. To address this challenge, we propose a technique called Acoustic Representation Optimization that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models. |
Weifei Jin; Junjie Su; Hejia Wang; Yulin Ye; Jie Hao; | arxiv-cs.SD | 2025-03-25 |
| 900 | Whispering in Amharic: Fine-tuning Whisper for Low-resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research. |
DAWIT KETEMA GETE et. al. | arxiv-cs.CL | 2025-03-24 |
| 901 | Elevating Robust Multi-Talker ASR By Decoupling Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. |
Yufeng Yang; Hassan Taherian; Vahid Ahmadi Kalkhorani; DeLiang Wang; | arxiv-cs.SD | 2025-03-22 |
| 902 | Your Voice Is Your Voice: Supporting Self-expression Through Speech Generation and LLMs in Augmented and Alternative Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users’ expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). |
YIWEN XU et. al. | arxiv-cs.HC | 2025-03-21 |
| 903 | Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We conducted a user study with 75 participants to evaluate the feasibility and efficiency of this workflow. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
| 904 | Automatic Speech Recognition: Comparisons Between Convolutional Neural Networks, Hidden Markov Model And Hybrid Architecture Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems have been widely used as a practical method of interaction between humans and devices. They are typically employed to enhance the … |
Lyndainês Santos; Nícolas de Araújo Moreira; Robson Sampaio; Raizielle Lima; Francisco Carlos Mattos Brito Oliveira; | Expert Systems | 2025-03-19 |
| 905 | Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
| 906 | T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis Via Multitask Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce T2V2 (**T**ext to **V**oice and **V**oice to **T**ext), a unified non-autoregressive model capable of performing both automatic speech recognition (ASR) and text-to-speech (TTS) synthesis within the same framework. |
Nabarun Goswami; Hanqin Wang; Tatsuya Harada; | iclr | 2025-03-17 |
| 907 | CR-CTC: Consistency Regularization on CTC for Improved Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. |
ZENGWEI YAO et. al. | iclr | 2025-03-17 |
| 908 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. |
AOHAN ZENG et. al. | iclr | 2025-03-17 |
| 909 | Speech Robust Bench: A Robustness Benchmark For Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. |
Muhammad A Shah; David Solans Noguero; Mikko A. Heikkilä; Bhiksha Raj; Nicolas Kourtellis; | iclr | 2025-03-17 |
| 910 | Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Hybrid-Autoregressive INference TrANsducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. |
Hainan Xu; Travis M. Bartley; Vladimir Bataev; Boris Ginsburg; | iclr | 2025-03-17 |
| 911 | Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively. |
Soumya Shamarao Jahagirdar; Jayasree Saha; C V Jawahar; | arxiv-cs.CV | 2025-03-11 |
| 912 | Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases. |
Michael McGuire; | arxiv-cs.CL | 2025-03-10 |
| 913 | Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. |
Lucas Block Medin; Thomas Pellegrini; Lucile Gelin; | arxiv-cs.SD | 2025-03-06 |
| 914 | Enhancing Traditional Kaldi Dysarthric Speech Recognition Using SSL-Features Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthric speech poses significant challenges for automated recognition systems, which perform well with normal speech. This paper investigates the adaptation of embeddings from … |
Paban Sapkota; Abhijit Sinha; H. Kathania; Sudarsana Reddy Kadiri; | 2025 National Conference on Communications (NCC) | 2025-03-06 |
| 915 | Large Language Models Cover for Speech Recognition Mistakes: Evaluating Conversational AI for Second Language Learners Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) technology has been reported to reach near-human performance in recent years, yet it continues to struggle with atypical speakers, particularly … |
Eva Verhelst; Tony Belpaeme; | 2025 20th ACM/IEEE International Conference on Human-Robot … | 2025-03-04 |
| 916 | Speaking with Robots in Noisy Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: A fundamental limitation for speech-enabled human-robot interaction (HRI) is automatic speech recognition (ASR), or the process of converting a raw speech signal into text. If … |
Shuubham Ojha; Felix Gervits; Carol Y. Espy-Wilson; | 2025 20th ACM/IEEE International Conference on Human-Robot … | 2025-03-04 |
| 917 | Direct Speech to Speech Translation: A Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our review examines the evolution of S2ST, comparing traditional cascade models which rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components with newer end to end and direct speech translation (DST) models that bypass intermediate text representations. |
Mohammad Sarim; Saim Shakeel; Laeeba Javed; Mohammad Nadeem; | arxiv-cs.CL | 2025-03-03 |
| 918 | Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study explores fine-tuning OpenAI’s Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. |
Samuel S. Sohn; Sten Knutsen; Karin Stromswold; | arxiv-cs.SD | 2025-03-03 |
| 919 | Unveiling Biases While Embracing Sustainability: Assessing The Dual Challenges of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. |
Ajinkya Kulkarni; Atharva Kulkarni; Miguel Couceiro; Isabel Trancoso; | arxiv-cs.CL | 2025-03-02 |
| 920 | Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents the development of ASR models fine-tuned specifically for Southeast Asian accents using a newly created dataset. |
MARCUS YU ZHE WEE et. al. | arxiv-cs.LG | 2025-02-27 |
| 921 | CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2025-02-26 |
| 922 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). |
KULUHAN BINICI et. al. | aaai | 2025-02-25 |
| 923 | Speed Master: Quick or Slow Play to Attack Speaker Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our paper presents a novel attack methodology named Speed Master, which undermines deep neural networks by manipulating the speed of speech samples. |
ZHE YE et. al. | aaai | 2025-02-25 |
| 924 | Connecting Voices: LoReSpeech As A Low-Resource Speech Parallel Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. |
Samy Ouzerrout; | arxiv-cs.CL | 2025-02-25 |
| 925 | LAMA-UT: Language Agnostic Multilingual ASR Through Orthography Unification and Language-Specific Transliteration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). |
Sangmin Lee; Woojin Chung; Hong-Goo Kang; | aaai | 2025-02-25 |
| 926 | Heuristic-free Knowledge Distillation for Streaming ASR Via Multi-modal Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Since the non-streaming teacher usually has less emission latency compared to the streaming student, the teacher’s prediction is typically shifted by $\tau$ frames, where the parameter $\tau$ is selected heuristically. In this paper, we observe that this manual shifting is sub-optimal and propose a novel framework, namely Heuristic-free KD. |
Ji Won Yoon; | aaai | 2025-02-25 |
| 927 | Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | aaai | 2025-02-25 |
| 928 | Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. |
JIAQING LIU et. al. | aaai | 2025-02-25 |
| 929 | Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. |
YIHAN WU et. al. | aaai | 2025-02-25 |
| 930 | Exploring Gender Disparities in Automatic Speech Recognition Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates factors influencing Automatic Speech Recognition (ASR) systems’ fairness and performance across genders, beyond the conventional examination of demographics. |
Hend ElGhazaly; Bahman Mirheidari; Nafise Sadat Moosavi; Heidi Christensen; | arxiv-cs.CL | 2025-02-25 |
| 931 | Uncertainty-Aware Self-Training for CTC-Based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the current sequence-level uncertainty estimation method for connectionist temporal classification (CTC) based ASR models drops the output probability information and depends only on the textual distance of decoded predictions. In this study, we argue that this results in limited performance improvement and propose a novel output probability-based sequence-level uncertainty estimation method. |
Eungbeom Kim; Kyogu Lee; | aaai | 2025-02-25 |
| 932 | Integrated Word2Vec-Based Speech Annotations for Enhanced EEG Decoding of Speech Intentions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Brain-computer interface (BCI) technology has promising applications as an intuitive communication tool and in fields such as language rehabilitation. This study aims to decode … |
SE-NA JANG et. al. | 2025 13th International Conference on Brain-Computer … | 2025-02-24 |
| 933 | Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. |
Haoxuan Wang; | arxiv-cs.CL | 2025-02-22 |
| 934 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. |
JENALEA RAJAB et. al. | arxiv-cs.CL | 2025-02-21 |
| 935 | Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. |
Hanin Atwany; Abdul Waheed; Rita Singh; Monojit Choudhury; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
| 936 | On The Robust Approximation of ASR Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. |
Abdul Waheed; Hanin Atwany; Rita Singh; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
| 937 | DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. |
XIANGYU LU et. al. | arxiv-cs.CL | 2025-02-16 |
| 938 | MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. |
Qingliang Meng; Pengju Ren; Tian Li; Changsong Dai; Huizhi Liang; | arxiv-cs.CL | 2025-02-14 |
| 939 | A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze three main architectural approaches: Network Speech Recognition (NSR), Distributed Speech Recognition (DSR), and Embedded Speech Recognition (ESR). |
Inaam F. Qutaiba I. Ali; | arxiv-cs.NI | 2025-02-10 |
| 940 | Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). |
REIHANEH AMOOIE et. al. | arxiv-cs.CL | 2025-02-07 |
| 941 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | arxiv-cs.CL | 2025-02-06 |
| 942 | Integrating Automatic Speech Recognition Into Remote Healthcare Interpreting: A Pilot Study of Its Impact on Interpreting Quality Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. |
Shiyi Tan; Constantin Orăsan; Sabine Braun; | arxiv-cs.CL | 2025-02-05 |
| 943 | A Differentiable Alignment Framework for Sequence-to-Sequence Modeling Via Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art end-to-end (E2E) ASRsystems, such as the Connectionist Temporal Classification (CTC) andtransducer-based models, suffer from peaky behavior and alignment inaccuracies.In this paper, we propose a novel differentiable alignment framework based onone-dimensional optimal transport, enabling the model to learn a singlealignment and perform ASR in an E2E manner. |
Yacouba Kaloga; Shashi Kumar; Petr Motlicek; Ina Kodrasi; | arxiv-cs.LG | 2025-02-03 |
| 944 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
Anna Seo Gyeong Choi; Jonghyeon Park; Myungwoo Oh; | arxiv-cs.CL | 2025-02-01 |
| 945 | When End-to-End Is Overkill: Rethinking Cascaded Speech-to-Text Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. |
Anna Min; Chenxu Hu; Yi Ren; Hang Zhao; | arxiv-cs.CL | 2025-02-01 |
| 946 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
Turi Abu; Ying Shi; Thomas Fang Zheng; Dong Wang; | arxiv-cs.CL | 2025-02-01 |
| 947 | SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). |
Dominik Wagner; Alexander Churchill; Siddharth Sigtia; Erik Marchi; | arxiv-cs.SD | 2025-01-31 |
| 948 | Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. |
Zhengdong Yang; Qianying Liu; Sheng Li; Fei Cheng; Chenhui Chu; | arxiv-cs.CL | 2025-01-29 |
| 949 | AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Toaddress these, we introduce the AVE speech, a comprehensive multi-modal datasetfor speech recognition tasks. |
DONGLIANG ZHOU et. al. | arxiv-cs.SD | 2025-01-28 |
| 950 | Speech Translation Refinement Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by the success of text-to-text translation refinement, this paper investigates how LLMs can improve the performance of speech translation by introducing a joint refinement process. |
HUAIXIA DOU et. al. | arxiv-cs.CL | 2025-01-25 |
| 951 | The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates the prevalence and impact of ASR errors in medical transcription in Nigeria, the United Kingdom, and the United States. |
Ayo Adedeji; Mardhiyah Sanni; Emmanuel Ayodele; Sarita Joshi; Tobi Olatunji; | arxiv-cs.CL | 2025-01-25 |
| 952 | DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, data2vec’s masked representation generation relies on multi-layer averaging, inevitably coupling these features. To address this limitation, we propose a decoupling quantization based data2vec (DQ-Data2vec) for multilingual ASR, which includes a data2vec backbone and two improved online K-means quantizers. |
Qijie Shao; Linhao Dong; Kun Wei; Sining Sun; Lei Xie; | arxiv-cs.SD | 2025-01-23 |
| 953 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. |
Moreno La Quatra; Valerio Mario Salerno; Yu Tsao; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-01-22 |
| 954 | Let SSMs Be ConvNets: State-space Modeling with Optimal Tensor Contractions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. |
Yan Ru Pei; | arxiv-cs.LG | 2025-01-22 |
| 955 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. |
XUELONG GENG et. al. | arxiv-cs.SD | 2025-01-22 |
| 956 | A Review on Speech Recognition Approaches and Challenges for Portuguese: Exploring The Feasibility of Fine-tuning Large-scale End-to-end Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: At present, automatic speech recognition has become an important bridge for human-computer interaction and is widely applied in multiple fields. The Portuguese speech recognition … |
Yan Li; Yapeng Wang; Lap-Man Hoi; Dingcheng Yang; Sio-Kei Im; | EURASIP J. Audio Speech Music. Process. | 2025-01-21 |
| 957 | DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing models and datasets are limited in their ability to effectively handle these challenges. To address this gap and foster progress in code-switching ASR research, we introduce the DOTA-ME-CS: Daily oriented text audio Mandarin-English code-switching dataset, which consists of 18.54 hours of audio data, including 9,300 recordings from 34 participants. |
Yupei Li; Zifan Wei; Heng Yu; Huichi Zhou; Björn W. Schuller; | arxiv-cs.SD | 2025-01-21 |
| 958 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
MATEUSZ BARAŃSKI et. al. | arxiv-cs.SD | 2025-01-20 |
| 959 | A Benchmark of French ASR Systems Based on Error Severity Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which … |
Antoine Tholly; Jane Wottawa; Mickael Rouvier; Richard Dufour; | arxiv-cs.CL | 2025-01-18 |
| 960 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
Takaaki Hori; Martin Kocour; Adnan Haider; Erik McDermott; Xiaodan Zhuang; | arxiv-cs.CL | 2025-01-15 |
| 961 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
Natarajan Balaji Shankar; Zilai Wang; Eray Eren; Abeer Alwan; | arxiv-cs.CL | 2025-01-14 |
| 962 | AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we propose AdaCS, a normalization model integrates an adaptive bias attention module (BAM) into encoder-decoder network. |
The Chuong Chu; Vu Tuan Dat Pham; Kien Dao; Hoang Nguyen; Quoc Hung Truong; | arxiv-cs.CL | 2025-01-13 |
| 963 | Extending Whisper for Korean-English Code-switching Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Among various automatic speech recognition (ASR) models, Whisper is a state-of-the-art model that demonstrates robust performance across various multilingual tasks. However, … |
H. Seong; N. Kim; Hyun Gon Ryu; Hyuk-Jae Lee; | 2025 IEEE International Conference on Consumer Electronics … | 2025-01-11 |
| 964 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. |
Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav Bhattacharya; | arxiv-cs.CL | 2025-01-10 |
| 965 | Universal-2-TF: Robust All-Neural Text Formatting for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). |
Yash Khare; Taufiquzzaman Peyash; Andrea Vanzo; Takuya Yoshioka; | arxiv-cs.CL | 2025-01-10 |
| 966 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Nonetheless, the evaluation of multilingual SLUis limited to shallow tasks such as intent classification or languageidentification. This is why we present Fleurs-SLU, a multilingual SLU benchmarkthat encompasses (i) 692 hours of speech for topical utterance classificationin 102 languages and (ii) multiple-choice question answering via listeningcomprehension spanning 944 hours of speech across 92 languages. |
Fabian David Schmidt; Ivan Vulić; Goran Glavaš; David Ifeoluwa Adelani; | arxiv-cs.CL | 2025-01-10 |
| 967 | Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. |
YUYA ASANO et. al. | arxiv-cs.CL | 2025-01-10 |
| 968 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
Eklavya Sarkar; Mathew Magimai. -Doss; | arxiv-cs.LG | 2025-01-10 |
| 969 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Samba ASR,the first state of the art Automatic Speech Recognition(ASR)model leveraging the novel Mamba architecture as both encoder and decoder,built on the foundation of state space models(SSMs). |
Syed Abdul Gaffar Shakhadri; Kruthika KR; Kartik Basavaraj Angadi; | arxiv-cs.CL | 2025-01-06 |
| 970 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
ZHONGJIAN CUI et. al. | arxiv-cs.SD | 2025-01-05 |
| 971 | Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. |
Vishal Sunder; Eric Fosler-Lussier; | arxiv-cs.LG | 2025-01-03 |
| 972 | Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of “listening and seeing again”. |
Rui Liu; Hongyu Yuan; Haizhou Li; | arxiv-cs.MM | 2025-01-03 |
| 973 | Machine Learning and Deep Learning Approaches for Accent Recognition: A Review Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Accent recognition has attracted immense research interest owing to the advancements in automatic speech recognition (ASR) systems. Accent variations are an essential factor in … |
Muzaffar Ahmad Dar; Jagalingam Pushparaj; | IEEE Access | 2025-01-01 |
| 974 | Leveraging Synthetic Data for Improved Manipuri-English Code-Switched ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Accurately recognizing code-switched speech presents a significant challenge in the field of Automatic speech recognition (ASR), particularly for low-resource regional languages. … |
Naorem Karline Singh; Wangkheimayum Madal; Chingakham Neeta Devi; Hoomexsun Pangsatabam; Y. J. Chanu; | IEEE Access | 2025-01-01 |
| 975 | Research on Digital Human Speech Recognition Method in High-Disturbance Industrial Environment Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The advent of industrial robotics and speech technology has precipitated a paradigm shift in the manner in which humans and machines collaborate. This article investigates the … |
Pengyu Zhu; Xiaobin Li; Haiyan Sun; Zhuoyi Chen; Jingsi Wang; | IEEE Transactions on Instrumentation and Measurement | 2025-01-01 |
| 976 | Controllable Conformer for Speech Enhancement and Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We propose a novel approach to speech enhancement, termed Controllable ConforMer for Speech Enhancement (CCMSE), which leverages a Conformer-based architecture integrated with a … |
Zilu Guo; Jun Du; Sabato Marco Siniscalchi; Jia Pan; Qingfeng Liu; | IEEE Signal Processing Letters | 2025-01-01 |
| 977 | Low-Resource Speech Recognition of Radiotelephony Communications Based on Continuous Learning of In-Domain and Out-of-Domain Knowledge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) in air traffic control (ATC) is a low-resource task with limited data and difficult annotation. Fine-tuning self-supervised pre-trained models … |
Guimin Jia; Dong He; Xilong Zhou; | IEEE Signal Processing Letters | 2025-01-01 |
| 978 | MLSS: Mandarin English Code-Switching Speech Recognition Via Mutual Learning-Based Semi-Supervised Method Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Code-switching is a phenomenon of alternating use of two or more languages within or between utterances in communication that often occurs in multilingual communities. Recently, … |
Cao Hong Nga; Duc-Quang Vu; P. Le; H. H. Luong; Jia-Ching Wang; | IEEE Signal Processing Letters | 2025-01-01 |
| 979 | On-Device Automatic Speech Recognition for Low-Resource Languages in Mixed Reality Industrial Metaverse Applications: Practical Guidelines and Evaluation of A Shipbuilding Application in Galician Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: As the Metaverse and Mixed Reality (MR) technologies continue to evolve, enabling natural and intuitive user interfaces is crucial. However, supporting low-resource languages in … |
Antón Valladares-Poncela; Paula Fraga-Lamas; T. Fernández-Caramés; | IEEE Access | 2025-01-01 |
| 980 | Empowering Dysarthric Communication: Hybrid Transformer-CTC-Based Speech Recognition System Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthria, a motor speech disorder, impairs the muscles involved in speech production, leading to challenges in articulation, pronunciation, and overall communication. This … |
R. VINOTHA et. al. | IEEE Access | 2025-01-01 |
| 981 | Relative Applicability of Diverse Automatic Speech Recognition Platforms for Transcription of Psychiatric Treatment Sessions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Service delivery in mental healthcare involves documentation of sensitive patient-clinician conversations that require serious caution. Conventionally, clinicians take handwritten … |
Rana Zeeshan; J. Bogue; Mamoona Naveed Asghar; | IEEE Access | 2025-01-01 |
| 982 | Deep Learning-Based Coding Strategy for Improved Cochlear Implant Speech Perception in Noisy Environments IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) and speech enhancement are essential tools in modern life, aiding not only in machine interaction but also in supporting individuals with … |
Billel Essaid; Hamza Kheddar; N. Batel; M. Chowdhury; | IEEE Access | 2025-01-01 |
| 983 | Transformer-Based Approach for Solving Mathematical Problems Using Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we introduce Vox Calculi, a system designed to solve mathematical problems using voice transcriptions. By leveraging state-of-the-art pretrained Automatic Speech … |
Ante Grgurević; Marina Bagić Babac; | IEEE Access | 2025-01-01 |
| 984 | Hybrid DNN-HMM-Based Approach for Telugu Language Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we discuss our research work in Telugu Indian language speech data for building a large language vocabulary to build Telugu speech recognition system. We have … |
M. Rama Rajeswari; S. Gangashetty; | IEEE Access | 2025-01-01 |
| 985 | Code-Switching ASR for Low-Resource Indic Languages: A Hindi-Marathi Case Study Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This work examines the development of Automatic Speech Recognition (ASR) systems for low-resource languages, focusing on Hindi and Marathi, particularly in multilingual and … |
H. PALIVELA et. al. | IEEE Access | 2025-01-01 |
| 986 | Automatic Classification of Speech Dysarthric Intelligibility Levels Using Textual Feature Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The comprehension of human language is fundamentally important in modern intelligent systems. Automatic Speech Intelligibility assessment involves determining the efficiency with … |
Ghadeer Alharbi; N. Alamri; Sahar F. Sabbeh; | IEEE Access | 2025-01-01 |
| 987 | Pitch-Speed Feature Space Data Augmentation for Automatic Speech Recognition Improvement in Low-Resource Scenario Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Rapid advancements in Artificial Intelligence (AI) and Human Computer Interaction (HCI) have introduced new sensory interaction paradigms that engage diverse age groups and skill … |
Syed Muhammad Zahid; Saad Ahmed Qazi; | IEEE Access | 2025-01-01 |
| 988 | Fairness in Automatic Speech Recognition Isn’t A One-Size-Fits-All Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Modern Automatic Speech Recognition (ASR) systems are increasingly deployed in high-stakes settings, including clinical interviews, public services, and educational tools, where … |
Hend Elghazaly; Bahman Mirheidari; Heidi Christensen; N. Moosavi; | Conference on Empirical Methods in Natural Language … | 2025-01-01 |
| 989 | A Review of Speech Recognition and Application to Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Eman Aboelela; Omar Mansour; | FICC | 2025-01-01 |
| 990 | Advancing Singlish Understanding: Bridging The Gap with Datasets and Multimodal Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains … |
BIN WANG et. al. | arxiv-cs.CL | 2025-01-01 |
| 991 | Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: While automatic speech recognition (ASR) now achieves human-level accuracy for a dozen or so languages, the majority of the world’s languages lack the resources needed to train … |
Éric Le Ferrand; Cian Mohamed Bashar Hauser; Joshua K. Hartshorne; Emily Prud’hommeaux; | IJCNLP-AACL | 2025-01-01 |
| 992 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. |
Or Haim Anidjar; Revital Marbel; Roi Yozevitch; | arxiv-cs.CL | 2024-12-31 |
| 993 | Fotheidil: An Automatic Transcription System for The Irish Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper sets out the first web-based transcription system for the Irish language – Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. |
LIAM LONERGAN et. al. | arxiv-cs.CL | 2024-12-31 |
| 994 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in A VR Pilot Study IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. |
Mykola Maslych; Christian Pumarada; Amirpouya Ghasemaghaei; Joseph J. LaViola Jr; | arxiv-cs.HC | 2024-12-30 |
| 995 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
KAREL MUNDNICH et. al. | arxiv-cs.CL | 2024-12-24 |
| 996 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
JIAHUI ZHAO et. al. | arxiv-cs.CL | 2024-12-21 |
| 997 | MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: For instance, when a presenter reads Euler’s Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. |
SIEUN HYEON et. al. | arxiv-cs.CL | 2024-12-20 |
| 998 | Transducer-Llama: Integrating LLMs Into Streamable Transducer-based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. |
KEQI DENG et. al. | arxiv-cs.CL | 2024-12-20 |
| 999 | LAMA-UT: Language Agnostic Multilingual ASR Through Orthography Unification and Language-Specific Transliteration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building a universal multilingual automatic speech recognition (ASR) modelthat performs equitably across languages has long been a challenge due to itsinherent difficulties. To address this task we introduce a Language-AgnosticMultilingual ASR pipeline through orthography Unification and language-specificTransliteration (LAMA-UT). |
Sangmin Lee; Woo-Jin Chung; Hong-Goo Kang; | arxiv-cs.CL | 2024-12-19 |
| 1000 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
HE WANG et. al. | arxiv-cs.SD | 2024-12-17 |