Paper Digest: Recent Papers on Speech Recognition
Paper Digest Team extracted all recent Speech Recognition related papers on our radar, and generated highlight sentences for them. The results are then sorted by relevance & date. In addition to this ‘static’ page, we also provide a real-time version of this article, which has more coverage and is updated in real time to include the most recent updates on this topic.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive daily paper digests on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: Recent Papers on Speech Recognition
| Paper | Author(s) | Source | Date | |
|---|---|---|---|---|
| 1 | Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. |
WARIT SIRICHOTEDUMRONG et. al. | arxiv-cs.CL | 2026-01-19 |
| 2 | DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To fill this gap, we first utilize gradient analysis to reveal that ASR and SR exhibit no inherent conflicts. Building on this, we propose Dual-task Universal Adversarial Perturbation (DUAP). |
Suyang Sun; Weifei Jin; Yuxin Cao; Wei Song; Jie Hao; | arxiv-cs.CR | 2026-01-19 |
| 3 | SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. |
Pu Wang; Shinji Watanabe; Hugo Van hamme; | arxiv-cs.SD | 2026-01-18 |
| 4 | WenetSpeech-Wu: Datasets, Benchmarks, and Models for A Unified Chinese Wu Dialect Speech Processing Ecosystem Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. |
CHENGYOU WANG et. al. | arxiv-cs.SD | 2026-01-16 |
| 5 | STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People Who Stutter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. |
ZIQI XU et. al. | arxiv-cs.CY | 2026-01-15 |
| 6 | Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. |
Runyuan Cai; Yu Lin; Yiming Wang; Chunlin Fu; Xiaodong Zeng; | arxiv-cs.SD | 2026-01-15 |
| 7 | Wav2Vec-based Audio Data Augmentation for Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article focuses on performing ADA techniques namely the addition of noise, pitch shifting, increasing or decreasing of speed and adding reverberation to the audio signals. |
P. Haritha; P. Shanmugavadivu; | International Research Journal on Advanced Engineering Hub … | 2026-01-13 |
| 8 | SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. |
TIANYI XU et. al. | arxiv-cs.CL | 2026-01-13 |
| 9 | Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. |
Kalvin Chang; Yiwen Shao; Jiahong Li; Dong Yu; | arxiv-cs.CL | 2026-01-12 |
| 10 | Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. |
Ayman Mansour; | arxiv-cs.CL | 2026-01-11 |
| 11 | Multimodal In-context Learning for ASR of Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. |
Zhaolin Li; Jan Niehues; | arxiv-cs.CL | 2026-01-09 |
| 12 | Stuttering-Aware Automatic Speech Recognition for Indonesian Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. |
Fadhil Muhammad; Alwin Djuliansah; Adrian Aryaputra Hamzah; Kurniawati Azizah; | arxiv-cs.CL | 2026-01-07 |
| 13 | Multi-channel Multi-speaker Transformer for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. |
Guo Yifan; Tian Yao; Suo Hongbin; Wan Yulong; | arxiv-cs.SD | 2026-01-05 |
| 14 | IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. |
JIAJIE ZHU et. al. | arxiv-cs.SD | 2026-01-03 |
| 15 | Index-ASR Technical Report Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. |
ZHESHU SONG et. al. | arxiv-cs.SD | 2025-12-31 |
| 16 | PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. |
Deepak Babu Piskala; | arxiv-cs.CL | 2025-12-29 |
| 17 | Fine-Tuning Whisper Model for Mandar Speech Recognition: Approach and Performance Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study aims to enhance the performance of Automatic Speech Recognition (ASR) systems by fine-tuning the Whisper model using a Mandar-specific dataset. |
Jafar Jafar; Mar Athul Wazithah Tb; Firman Aziz; Rosary Iriany; Norma Nasir; | Journal of Applied Engineering and Technological Science … | 2025-12-29 |
| 18 | When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. |
SUJAL CHONDHEKAR et. al. | arxiv-cs.SD | 2025-12-19 |
| 19 | Voice Assistant for Desktop Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the design and implementation of a Voice Assistant for Desktop, an intelligent system that enables users to interact with desktop computers using natural language voice commands. |
Geeta Patil; | International Journal for Research in Applied Science and … | 2025-12-18 |
| 20 | Adapting Speech Language Model to Singing Voice Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. |
YIWEN ZHAO et. al. | arxiv-cs.SD | 2025-12-16 |
| 21 | System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present the design, implementation, and in-situ deployment of a smartphone-based voice-enabled AI system for generating electronic medical records (EMRs) and clinical risk alerts in maternal healthcare settings. |
MARYAM MUSTAFA et. al. | arxiv-cs.HC | 2025-12-13 |
| 22 | TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. |
Elroy Galbraith; Chadwick Sutherland; Donahue Morgan; | arxiv-cs.CL | 2025-12-11 |
| 23 | Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. |
Srihari Bandarupalli; Bhavana Akkiraju; Charan Devarakonda; Vamsiraghusimha Narsinga; Anil Kumar Vuppala; | arxiv-cs.CL | 2025-12-08 |
| 24 | The Development and Experimental Evaluation of A Multilingual Speech Corpus for Low-Resource Turkic Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. |
AIDANA KARIBAYEVA et. al. | Applied Sciences | 2025-12-05 |
| 25 | Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This research presents a novel approach to enhancing automatic speech recognition systems by integrating noise detection capabilities directly into the recognition architecture. |
Karamvir Singh; | arxiv-cs.SD | 2025-12-02 |
| 26 | Swivuriso: The South African Next Voices Multilingual Speech Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. |
VUKOSI MARIVATEE et. al. | arxiv-cs.CL | 2025-12-01 |
| 27 | ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models Without Back-Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. |
YUEZHANG PENG et. al. | arxiv-cs.MM | 2025-11-30 |
| 28 | Benchmarking Automatic Speech Recognition Models for African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. |
ALVIN NAHABWE et. al. | arxiv-cs.CL | 2025-11-30 |
| 29 | ASR Under The Stethoscope: Evaluating Biases in Clinical Speech Recognition Across Indian Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. |
SUBHAM KUMAR et. al. | arxiv-cs.CL | 2025-11-30 |
| 30 | Deep Learning Techniques for Hindi Automatic Speech Recognition: A Comprehensive Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study examines multiple models on publicly available speech datasets to evaluate their performance for practical implementation. |
Hetal Gaudani; Narendra M Patel; | International Journal of Latest Technology in Engineering … | 2025-11-28 |
| 31 | HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a new benchmark for fully evaluating the human-level perceptual and understanding capabilities of Speech LLMs. |
CHEN LI et. al. | arxiv-cs.SD | 2025-11-28 |
| 32 | Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on The Loquacious Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. |
NICK ROSSENBACH et. al. | arxiv-cs.CL | 2025-11-27 |
| 33 | ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers Using Phonetic Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. |
Ye Bhone Lin; Thura Aung; Ye Kyaw Thu; Thazin Myint Oo; | arxiv-cs.CL | 2025-11-26 |
| 34 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. |
JIONGHAO HAN et. al. | arxiv-cs.SD | 2025-11-25 |
| 35 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. |
Bashar Talafha; Amin Abu Alhassan; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2025-11-24 |
| 36 | Smart Voice Assistant Using Machine Learning and Deep Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This project focuses on understanding the architecture, working principles, applications, and challenges of Smart Voice Assistants. |
Gorelal Verma; Deepesh Dewangan; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2025-11-19 |
| 37 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. |
Gabrial Zencha Ashungafac; Mardhiyah Sanni; Busayo Awobade; Alex Gichamba; Tobi Olatunji; | arxiv-cs.CL | 2025-11-18 |
| 38 | Toward Conversational Hungarian Speech Recognition: Introducing The BEA-Large and BEA-Dialogue Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages. |
MÁTÉ GEDEON et. al. | arxiv-cs.CL | 2025-11-17 |
| 39 | Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. |
YIMING RONG et. al. | arxiv-cs.CL | 2025-11-14 |
| 40 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most–all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. |
OMNILINGUAL ASR TEAM et. al. | arxiv-cs.CL | 2025-11-12 |
| 41 | CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, prior studies have rarely exploredstaged strategies that integrate both annotation types. To address this gap, wepresent CLiFT-ASR, a cross-lingual fine-tuning framework that builds onMandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. |
HUNG-YANG SUNG et. al. | arxiv-cs.CL | 2025-11-10 |
| 42 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, this E2E speech synthesis also requiresnew security mechanisms. To tackle these challenges, we propose E2E-VGuard, aproactive defense framework for two emerging threats: (1) production LLM-basedspeech synthesis, and (2) the novel attack arising from ASR-driven E2Escenarios. |
ZHISHENG ZHANG et. al. | arxiv-cs.SD | 2025-11-10 |
| 43 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. |
YAN GAO et. al. | arxiv-cs.CL | 2025-11-09 |
| 44 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. |
ZHISHENG ZHANG et. al. | nips | 2025-11-07 |
| 45 | MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. |
UMBERTO CAPPELLAZZO et. al. | nips | 2025-11-07 |
| 46 | CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CantoASR,a collaborative ASR-LALM error correction framework that integrates forcedalignment for acoustic feature extraction, a LoRA-finetuned Whisper forimproved tone discrimination, and an instruction-tuned Qwen-Audio forprosody-aware correction. |
DAZHONG CHEN et. al. | arxiv-cs.CL | 2025-11-06 |
| 47 | WST: Weakly Supervised Transducer for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted inend-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavilyon large-scale, high-quality annotated data, which are often costly anddifficult to obtain. To mitigate this reliance, we propose a Weakly SupervisedTransducer (WST), which integrates a flexible training graph designed torobustly handle errors in the transcripts without requiring additionalconfidence estimation or auxiliary pre-trained models. |
DONGJI GAO et. al. | arxiv-cs.CL | 2025-11-05 |
| 48 | Intelligent Navigation Assistant for Campuses Using Speech Recognition, NLP And A* Algorithm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Abstract Campus navigation often produce challenges for students, staff, and visitors, specifically in large complex. This research presents an intelligent navigation assistant that combines automatic speech recognition (ASR), Natural Language Processing (NLP), and the A* pathfinding algorithm to tackle these issues through voice interaction. |
Kirti A. Satpute; Mrunal Wakadkar; Shruti Nimbalkar; Sneha Shelar; Prof. Kalpana Sonval; | INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING … | 2025-11-04 |
| 49 | Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. |
WEIQIAO SHAN et. al. | emnlp | 2025-11-02 |
| 50 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We apply LLMs to ASR error correction in three paradigms. |
Victor Junqiu Wei; Weicheng Wang; Di Jiang; Yuanfeng Song; Lu Wang; | emnlp | 2025-11-02 |
| 51 | Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate this, we compare four strategies: (a) *normative* models trained on typical speech (no personalization), (b) *idiosyncratic* models completely personalized to individuals, (c) *dysarthric-normative* models trained on other dysarthric speakers, and (d) *dysarthric-idiosyncratic* models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than the idiosyncratic approach while requiring less than half as much personalized data (36. |
Vishnu Raja; Adithya V Ganesan; Anand Syamkumar; Ritwik Banerjee; H. Schwartz; | emnlp | 2025-11-02 |
| 52 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Speech Back-Translation, a a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. |
Tianduo Wang; Lu Xu; Wei Lu; Shanbo Cheng; | emnlp | 2025-11-02 |
| 53 | Dynamic Model-Bank Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. |
YANSHUO WANG et. al. | emnlp | 2025-11-02 |
| 54 | LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. |
Keisuke Kamahori; Jungo Kasai; Noriyuki Kojima; Baris Kasikci; | emnlp | 2025-11-02 |
| 55 | Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. |
Linyang He; Qiaolin Wang; Xilin Jiang; Nima Mesgarani; | emnlp | 2025-11-02 |
| 56 | Visual-Aware Speech Recognition for Noisy Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. |
Balaji Darur; Karan Singla; | emnlp | 2025-11-02 |
| 57 | CoVoGER: A Multilingual Multitask Benchmark for Speech-to-text Generative Error Correction with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CoVoGER, a benchmark for GER that covers both ASR and speech-to-text translation (ST) across 15 languages and 28 language pairs. |
Zhengdong Yang; Zhen Wan; Sheng Li; Chao-Han Huck Yang; Chenhui Chu; | emnlp | 2025-11-02 |
| 58 | POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce POWSM (Phonetic Open Whisper-style SpeechModel), the first unified framework capable of jointly performing multiplephone-related tasks. |
CHIN-JOU LI et. al. | arxiv-cs.CL | 2025-10-28 |
| 59 | BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BEARD (BEST-RQ Encoder Adaptation with Re-trainingand Distillation), a novel framework designed to adapt Whisper’s encoder usingunlabeled data. |
Raphaël Bagat; Irina Illina; Emmanuel Vincent; | arxiv-cs.CL | 2025-10-28 |
| 60 | Are ASR Foundation Models Generalized Enough to Capture Features of Regional Dialects for Low-resource Languages? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate theeffects of dialectal variations on ASR we develop a 78-hour annotated BengaliSpeech-to-Text (STT) corpus named Ben-10. |
TAWSIF TASHWAR DIPTO et. al. | arxiv-cs.CL | 2025-10-27 |
| 61 | Arabic Little STT: Arabic Children Speech Recognition Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the absence of child-specificspeech corpora is an essential gap that poses significant challenges. Toaddress this gap, we present our created dataset, Arabic Little STT, a datasetof Levantine Arabic child speech recorded in classrooms, containing 355utterances from 288 children (ages 6 – 13). |
Mouhand Alkadri; Dania Desouki; Khloud Al Jallad; | arxiv-cs.CL | 2025-10-27 |
| 62 | LRW-Persian: Lip-reading in The Wild Dataset for Persian Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LRW-Persian, the largest in-the-wild Persianword-level lipreading dataset, comprising $743$ target words and over$414{,}000$ video samples extracted from more than $1{,}900$ hours of footageacross $67$ television programs. |
Zahra Taghizadeh; Mohammad Shahverdikondori; Arian Noori; Alireza Dadgarnia; | arxiv-cs.CV | 2025-10-26 |
| 63 | A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using The Pacific Northwest English Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a systematic evaluation of racial bias in four majorcommercial automatic speech recognition (ASR) systems using the PacificNorthwest English (PNWE) corpus. |
Michael Scott; Siyu Liang; Alicia Wassink; Gina-Anne Levow; | arxiv-cs.CL | 2025-10-25 |
| 64 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Throughempirical analysis, we discover that targeted architectural simplification canunlock the acoustic modeling potential of Whisper, a text-aligned AutomaticSpeech Recognition (ASR) model. |
Xin Zhang; Lin Li; Xiangni Lu; Jianquan Liu; Kong Aik Lee; | arxiv-cs.SD | 2025-10-23 |
| 65 | Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we evaluate MBRdecoding for ASR and ST tasks on English and Japanese using Whisper and itsderivative models. |
Yuu Jinnai; | arxiv-cs.CL | 2025-10-22 |
| 66 | MLMA: Towards Multilingual ASR With Mamba-based Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, weintroduce MLMA (Multilingual Language Modeling with Mamba for ASR), a newapproach that leverages the Mamba architecture — an efficient state-spacemodel optimized for long-context sequence processing — for multilingual ASR.Using Mamba, MLMA implicitly incorporates language-aware conditioning andshared representations to support robust recognition across diverse languages.Experiments on standard multilingual benchmarks show that MLMA achievescompetitive performance compared to Transformer-based architectures. |
Mohamed Nabih Ali; Daniele Falavigna; Alessio Brutti; | arxiv-cs.CL | 2025-10-21 |
| 67 | VALLR: Visual ASR Language Model for Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. |
Marshall Thomas; Edward Fish; Richard Bowden; | iccv | 2025-10-20 |
| 68 | Probing The Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the untapped potential of Whisper, awell-established automatic speech recognition (ASR) foundation model, in thecontext of L2 spoken language assessment (SLA). |
Fu-An Chao; Bi-Cheng Yan; Berlin Chen; | arxiv-cs.CL | 2025-10-18 |
| 69 | Hallucination Benchmark for Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, there is a critical need for new evaluationframeworks that can effectively identify and assess models with a heightenedpropensity for generating hallucinated content. |
Alkis Koudounas; Moreno La Quatra; Manuel Giollo; Sabato Marco Siniscalchi; Elena Baralis; | arxiv-cs.CL | 2025-10-18 |
| 70 | Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a governance-centered ASR lifecycle as an emergent interdisciplinary framework for responsible ASR development and offer implications for researchers, practitioners, and policymakers seeking to address language marginalization in speech AI systems. |
JAY L. CUNNINGHAM et. al. | Proceedings of the AAAI/ACM Conference on AI, Ethics, and … | 2025-10-15 |
| 71 | End-to-end Speech Recognition with Similar Length Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In thispaper, we focus on speech recognition in those cases where the length of speechaligns closely with that of the corresponding text. |
Peng Fan; Wenping Wang; Fei Deng; | arxiv-cs.CL | 2025-10-12 |
| 72 | End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores acombined end-to-end architecture of pre-trained speech encoders and LargeLanguage Models (LLMs) for performing both Automatic Speech Recognition (ASR)and ST simultaneously. |
Nam Luu; Ondřej Bojar; | arxiv-cs.CL | 2025-10-11 |
| 73 | Accent-Invariant Automatic Speech Recognition Via Saliency-Driven Spectrogram Masking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For Persian, we introduce a newly collected dataset spanning multipleregional accents, establishing the first systematic benchmark for accentvariation in Persian ASR that fills a critical gap in multilingual speechresearch and provides a foundation for future studies on low-resource,linguistically diverse languages. |
Mohammad Hossein Sameti; Sepehr Harfi Moridani; Ali Zarean; Hossein Sameti; | arxiv-cs.CL | 2025-10-10 |
| 74 | Voice-Enabled Local Language Translator Using Generative AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a voice-enabled local language translator powered by Generative AI, integrating Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), generative translation, and Text-to-Speech (TTS) synthesis for accurate, context-aware output. |
Dr. V. Shanmugapriya; J N Pravanthika; | International Scientific Journal of Engineering and … | 2025-10-10 |
| 75 | Speech Recognition and Synthesis Models and Platforms for The Kazakh Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. |
Aidana Karibayeva; Vladislav Karyukin; Balzhan Abduali; Dina Amirova; | Information | 2025-10-10 |
| 76 | Serial-Parallel Dual-Path Architecture for Speaking Style Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel serial-paralleldual-path architecture for SSR that leverages acoustic-linguistic bimodalinformation. |
GUOJIAN LI et. al. | arxiv-cs.SD | 2025-10-09 |
| 77 | Linguistically Informed Tokenization Improves ASR for Underresourced Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that alinguistically informed phonemic tokenization system substantially improves WERand CER compared to a baseline orthographic tokenization scheme. |
Massimo Daul; Alessio Tosolini; Claire Bowern; | arxiv-cs.CL | 2025-10-07 |
| 78 | How I Built ASR for Endangered Languages with A Spoken Dictionary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inthis paper, we explore how little data, and in what form, is needed to buildASR for critically endangered languages. |
Christopher Bartley; Anton Ragni; | arxiv-cs.CL | 2025-10-06 |
| 79 | Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling Than CoT Prompting? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematicallycompare CoT and Direct prompting under increasing amounts of S2TT data. |
Oriol Pareras; Gerard I. Gállego; Federico Costa; Cristina España-Bonet; Javier Hernando; | arxiv-cs.CL | 2025-10-03 |
| 80 | EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EvolveCaptions, a real-time, collaborative ASRadaptation system that supports in-situ personalization with minimal effort.Hearing participants correct ASR errors during live conversations. |
Liang-Yuan Wu; Dhruv Jain; | arxiv-cs.HC | 2025-10-02 |
| 81 | Backdoor Attacks Against Speech Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present the firstsystematic study of audio backdoor attacks against speech language models. |
Alexandrine Fortier; Thomas Thebaud; Jesús Villalba; Najim Dehak; Patrick Cardinal; | arxiv-cs.CL | 2025-10-01 |
| 82 | ASR Under Noise: Exploring Robustness for Sundanese and Javanese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate the robustness of Whisper-based automatic speech recognition(ASR) models for two major Indonesian regional languages: Javanese andSundanese. |
Salsabila Zahirah Pranida; Muhammad Cendekia Airlangga; Rifo Ahmad Genadi; Shady Shehata; | arxiv-cs.CL | 2025-09-30 |
| 83 | Confidence-Guided Error Correction for Disordered Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate the use of large language models (LLMs) as post-processingmodules for automatic speech recognition (ASR), focusing on their ability toperform error correction for disordered speech. |
Abner Hernandez; Tomás Arias Vergara; Andreas Maier; Paula Andrea Pérez-Toro; | arxiv-cs.CL | 2025-09-29 |
| 84 | HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, weintroduce HiKE: the Hierarchical Korean-English code-switching benchmark, thefirst globally accessible evaluation framework for Korean-English CS, aiming toprovide a means for the precise evaluation of multilingual ASR models and tofoster research in the field. |
Gio Paik; Yongbeom Kim; Soungmin Lee; Sangmin Ahn; Chanwoo Kim; | arxiv-cs.CL | 2025-09-29 |
| 85 | MeanFlowSE: One-Step Generative Speech Enhancement Via MeanFlow Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech enhancement (SE) recovers clean speech from noisy signals and is vitalfor applications such as telecommunications and automatic speech recognition(ASR). While generative … |
YIKE ZHU et. al. | arxiv-cs.SD | 2025-09-27 |
| 86 | Lightweight Front-end Enhancement for Robust ASR Via Frame Resampling and Sub-Band Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes optimizations to reduce SE computational costswithout compromising ASR performance. |
Siyi Zhao; Wei Wang; Yanmin Qian; | arxiv-cs.SD | 2025-09-25 |
| 87 | Align2Speak: Improving TTS for Low Resource Languages Via ASR-Guided Online Preference Optimization Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Developing high-quality text-to-speech (TTS) systems for low-resourcelanguages is challenging due to the scarcity of paired text and speech data. Incontrast, automatic speech … |
SHEHZEEN HUSSAIN et. al. | arxiv-cs.AI | 2025-09-25 |
| 88 | SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We developed a robust processingpipeline to align and segment long-form recordings into clean, 30-secondaudio-transcript pairs suitable for model training. We use this dataset tofine-tune several OpenAI Whisper models (small, medium, large-v3, andlarge-v3-turbo), achieving significant Word Error Rate (WER) reductions onstandard Slovak benchmarks like Common Voice and FLEURS. |
Erik Božík; Marek Šuppa; | arxiv-cs.CL | 2025-09-23 |
| 89 | M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address thischallenge, in our previous work, we introduced two auxiliary tasks, namely, ASRerror detection and ASR error correction, and we proposed a novel multimodalfusion (MF) method for learning modality-specific and modality-invariantrepresentations across different modalities. |
JIAJUN HE et. al. | arxiv-cs.HC | 2025-09-23 |
| 90 | Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: State-of-the-art automatic speech recognition (ASR) models like Whisper,perform poorly on atypical speech, such as that produced by individuals withdysarthria. |
Vishnu Raja; Adithya V Ganesan; Anand Syamkumar; Ritwik Banerjee; H Andrew Schwartz; | arxiv-cs.SD | 2025-09-20 |
| 91 | MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Progress in NV-aware ASR hasbeen hindered by the lack of high-quality, well-annotated datasets. To addressthis gap, we introduce MNV-17, a 7.55-hour performative Mandarin speechdataset. |
JIALONG MAI et. al. | arxiv-cs.SD | 2025-09-19 |
| 92 | Thinking in Cocktail Party: Chain-of-Thought and Reinforcement Learning for Target Speaker Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe thespeech of a specified target speaker from multi-speaker mixtures in cocktailparty scenarios. Recent … |
Yiru Zhang; Hang Su; Lichun Fan; Zhenbo Luo; Jian Luan; | arxiv-cs.SD | 2025-09-19 |
| 93 | Speech Language Models for Under-Represented Languages: Insights from Wolof Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present our journey in training a speech language model for Wolof, anunderrepresented language spoken in West Africa, and share key insights. |
Yaya Sy; Dioula Doucouré; Christophe Cerisara; Irina Illina; | arxiv-cs.CL | 2025-09-18 |
| 94 | Frustratingly Easy Data Augmentation for Low-Resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces three self-contained data augmentation methods forlow-resource Automatic Speech Recognition (ASR). |
Katsumi Ibaraki; David Chiang; | arxiv-cs.CL | 2025-09-18 |
| 95 | HARNESS: Lightweight Distilled Arabic Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, weintroduce HArnESS, the first Arabic-centric self-supervised speech modelfamily, designed to capture Arabic speech nuances. |
Vrunda N. sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2025-09-18 |
| 96 | Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This report introduces Canary-1B-v2, a fast, robust multilingual model forAutomatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). |
MONICA SEKOYAN et. al. | arxiv-cs.CL | 2025-09-17 |
| 97 | Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes our efforts to develop the methodology and tools forhuman-subject experiments with MCV. |
JAN JANAK et. al. | arxiv-cs.NI | 2025-09-17 |
| 98 | PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a Pronunciation-Aware Contextualized (PAC) framework toaddress two key challenges in Large Language Model (LLM)-based Automatic SpeechRecognition (ASR) systems: effective pronunciation modeling and robusthomophone discrimination. |
LI FU et. al. | arxiv-cs.CL | 2025-09-16 |
| 99 | Fun-ASR Technical Report Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, wepresent Fun-ASR, a large-scale, LLM-based ASR system that synergisticallycombines massive data, large model capacity, LLM integration, and reinforcementlearning to achieve state-of-the-art performance across diverse and complexspeech recognition scenarios. |
KEYU AN et. al. | arxiv-cs.CL | 2025-09-15 |
| 100 | Prominence-aware Automatic Speech Recognition for Conversational Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates prominence-aware automatic speech recognition (ASR)by combining prominence detection and speech recognition for conversationalAustrian German. |
Julian Linke; Barbara Schuppler; | arxiv-cs.CL | 2025-09-12 |
| 101 | WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose WhisTLE, a deeply supervised,text-only adaptation method for pretrained encoder-decoder ASR models. |
Akshat Pandey; Karun Kumar; Raphael Tang; | arxiv-cs.CL | 2025-09-12 |
| 102 | Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a beamforming approach that processesspecific angular sectors based on their spherical polar coordinates beforeapplying an end-to-end multichannel, multi-speaker ASR system. |
Can Cui; Paul Magron; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.SD | 2025-09-12 |
| 103 | TSPC: A Two-Stage Phoneme-Centric Architecture for Code-switching Vietnamese-English Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel architecture forVietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). |
Minh N. H. Nguyen; Anh Nguyen Tran; Dung Truong Dinh; Nam Van Vo; | arxiv-cs.SD | 2025-09-07 |
| 104 | Contextualized Token Discrimination for Speech Search Query Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With thegrowing popularity of speech search driven by Automated Speech Recognition(ASR) systems, this paper introduces a novel method named Contextualized TokenDiscrimination (CTD) to conduct effective speech query correction. |
JUNYU LU et. al. | arxiv-cs.SD | 2025-09-04 |
| 105 | PARCO: Phoneme-Augmented Robust Contextual ASR Via Contrastive Entity Disambiguation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) systems struggle with domain-specificnamed entities, especially homophones. Contextual ASR improves recognition butoften fails to capture … |
Jiajun He; Naoki Sawada; Koichi Miyazaki; Tomoki Toda; | arxiv-cs.CL | 2025-09-04 |
| 106 | WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address thischallenge, we propose WenetSpeech-Pipe, an integrated pipeline for buildinglarge-scale speech corpus with multi-dimensional annotation tailored for speechunderstanding and generation. |
LONGHAO LI et. al. | arxiv-cs.SD | 2025-09-04 |
| 107 | LatPhon: Lightweight Multilingual G2P for Romance Languages and English Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech(TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST)and alignment systems, … |
Luis Felipe Chary; Miguel Arjona Ramirez; | arxiv-cs.CL | 2025-09-03 |
| 108 | SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking Under Domain Shift in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While low-rank adaptation (LoRA) is widelyused in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA,PiSSA, and SVFT, are developed mainly for language and vision tasks, withlimited validation in speech. This work presents the first comprehensiveintegration and benchmarking of these PEFT methods within ESPnet. |
Pu Wang; Shinji Watanabe; Hugo Van hamme; | arxiv-cs.CL | 2025-09-02 |
| 109 | CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-Car Speech Separation with Distributed Heterogeneous Arrays Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes CabinSep, a lightweight neuralmask-based minimum variance distortionless response (MVDR) speech separationapproach, to reduce speech recognition errors in back-end automatic speechrecognition (ASR) models. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2025-09-01 |
| 110 | Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we proposeextracting serialized output prompts (SOP) and explicitly guiding the LLM usingstructured prompts to improve system performance (SOP-MT-ASR). |
HAO SHI et. al. | arxiv-cs.CL | 2025-08-31 |
| 111 | A Unified Denoising and Adaptation Framework for Self-Supervised Bengali Dialectal ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge is compoundedby two persistent and intertwined factors: the language’s vast dialectaldiversity and the prevalence of acoustic noise in real-world environments.While state-of-the-art self-supervised learning (SSL) models have advanced ASRfor low-resource languages, they often lack explicit mechanisms to handleenvironmental noise during pre-training or specialized adaptation strategiesfor the complex phonetic and lexical variations across Bengali dialects. Thispaper introduces a novel, unified framework designed to address these dualchallenges simultaneously. |
Swadhin Biswas; Tuhin Sheikh; | arxiv-cs.SD | 2025-08-31 |
| 112 | Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, wepropose the first unified framework capable of handling diverse combinations ofsign language, lip movements, and audio for spoken-language text generation. |
Jeong Hun Yeo; Hyeongseop Rha; Sungjune Park; Junil Won; Yong Man Ro; | arxiv-cs.CV | 2025-08-28 |
| 113 | Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD Database Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthric speech recognition faces challenges from severity variations anddisparities relative to normal speech. Conventional approaches individuallyfine-tune ASR models … |
Qing Xiao; Yingshan Peng; PeiPei Zhang; | arxiv-cs.SD | 2025-08-26 |
| 114 | Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate four state-of-the-art ASR systems on eight publiclyavailable datasets that capture six dimensions of difficulty: domain-specific,accented, noisy, age-variant, impaired, and spontaneous speech. |
Theresa Pekarek Rosin; Julia Gachot; Henri-Leon Kordt; Matthias Kerzel; Stefan Wermter; | arxiv-cs.RO | 2025-08-25 |
| 115 | Whisper Based Cross-Lingual Phoneme Recognition Between Vietnamese and English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike many languages, Vietnamese relies on tonal variations todistinguish word meanings, whereas English features stress patterns andnon-standard pronunciations that hinder phoneme alignment between the twolanguages. To address this challenge, we propose a novel bilingual speechrecognition approach with two primary contributions: (1) constructing arepresentative bilingual phoneme set that bridges the differences betweenVietnamese and English phonetic systems; (2) designing an end-to-end systemthat leverages the PhoWhisper pre-trained encoder for deep high-levelrepresentations to improve phoneme recognition. |
Nguyen Huu Nhat Minh; Tran Nguyen Anh; Truong Dinh Dung; Vo Van Nam; Le Pham Tuyen; | arxiv-cs.CL | 2025-08-22 |
| 116 | Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We compare flat-start training across multiple datasets, SSLrepresentations (WavLM, XEUS), and decoder architectures. |
ANYU YING et. al. | arxiv-cs.LG | 2025-08-22 |
| 117 | H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existingmodels often struggle with large-scale hotwords, as the recognition rate dropsdramatically with the number of hotwords increasing. In this paper, weintroduce a novel hotword customization system that utilizes a hotwordpre-retrieval module (H-PRM) to identify the most relevant hotword candidate bymeasuring the acoustic similarity between the hotwords and the speech segment.This plug-and-play solution can be easily integrated into traditional modelssuch as SeACo-Paraformer, significantly enhancing hotwords post-recall rate(PRR). |
HUANGYU DAI et. al. | arxiv-cs.SD | 2025-08-22 |
| 118 | UniCoM: A Universal Code-Switching Speech Generator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, systemscapable of handling this phenomenon remain underexplored, primarily due to thescarcity of suitable datasets. To resolve this issue, we propose UniversalCode-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CSsamples without altering sentence semantics. |
Sangmin Lee; Woojin Chung; Seyun Um; Hong-Goo Kang; | arxiv-cs.CL | 2025-08-21 |
| 119 | Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using A Speech Error Database Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This analysis demonstrates the atabase’s effectivenessas a diagnostic tool for ASR system performance. |
John Alderete; Macarious Kin Fung Hui; Aanchan Mohan; | arxiv-cs.CL | 2025-08-18 |
| 120 | Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although state-of-the-art ASR systems based onRecurrent Neural Network Transducers (RNN-T) can perform real-timetranscription, achieving streaming translation in real-time remains asignificant challenge. To address this issue, we propose a simultaneoustranslation approach that effectively balances translation quality and latency.We also investigate efficient integration of ASR and MT, leveraging linguisticcues generated by the ASR system to manage context and utilizing efficientbeam-search pruning techniques such as time-out and forced finalization tomaintain system’s real-time factor. |
ZEESHAN AHMED et. al. | arxiv-cs.CL | 2025-08-18 |
| 121 | Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach that enhances ASRby distilling contextual knowledge from LLaMA models into Whisper. |
Duygu Altinok; | arxiv-cs.CL | 2025-08-18 |
| 122 | Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Zero-shot automatic speech recognition (ASR) for children is challenging due to pronounced acoustic and linguistic mis-matches, speaker variability and limited annotated data. … |
Abhijit Sinha; H. Kathania; Mikko Kurimo; | Interspeech 2025 | 2025-08-17 |
| 123 | Analysis of Domain Shift Across ASR Architectures Via TTS-Enabled Separation of Target Domain and Acoustic Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze automatic speech recognition (ASR) modeling choices under domainmismatch, comparing classic modular and novel sequence-to-sequence (seq2seq)architectures. |
Tina Raissi; Nick Rossenbach; Ralf Schlüter; | arxiv-cs.SD | 2025-08-13 |
| 124 | A Comparative Analysis on ASR System Combination for Attention, CTC, Factored Hybrid, and Transducer Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we compare model combination acrosspopular ASR architectures. |
NOURELDIN BAYOUMI et. al. | arxiv-cs.SD | 2025-08-13 |
| 125 | Assessing The Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates the feasibility of lightweight Whisper models (Tiny,Base, Small) for Urdu speech recognition in low-resource settings. |
Abdul Rehman Antall; Naveed Akhtar; | arxiv-cs.CL | 2025-08-13 |
| 126 | Munsit at NADI 2025 Shared Task 2: Pushing The Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a scalabletraining pipeline that combines weakly supervised learning with supervisedfine-tuning to develop a robust Arabic ASR model. |
Mahmoud Salhab; Shameed Sait; Mohammad Abusheikh; Hasan Abusheikh; | arxiv-cs.CL | 2025-08-12 |
| 127 | Revealing The Role of Audio Channels in ASR Performance Degradation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Pre-trained automatic speech recognition (ASR) models have demonstratedstrong performance on a variety of tasks. However, their performance candegrade substantially when the input … |
Kuan-Tang Huang; Li-Wei Chen; Hung-Shin Lee; Berlin Chen; Hsin-Min Wang; | arxiv-cs.SD | 2025-08-12 |
| 128 | Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Speech Recognition (ASR) due to phoneme distortions and high variability.While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shownpromise, their effectiveness in dysarthric speech remains unclear. This studysystematically benchmarks these models with different decoding strategies,including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). |
Ahmed Aboeitta; Ahmed Sharshar; Youssef Nafea; Shady Shehata; | arxiv-cs.SD | 2025-08-11 |
| 129 | A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further optimize AEC’sdownstream applications, we introduce a novel post-processing strategyemploying tailored parameters designed specifically for tasks such as VoiceActivity Detection (VAD) and Automatic Speech Recognition (ASR), thus enhancingtheir overall efficacy. |
Yiheng Jiang; Tian Biao; | arxiv-cs.SD | 2025-08-10 |
| 130 | Fairness of Automatic Speech Recognition: Looking Through A Philosophical Lens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper examines ASR bias through aphilosophical lens, arguing that systematic misrecognition of certain speechvarieties constitutes more than a technical limitation — it represents a formof disrespect that compounds historical injustices against marginalizedlinguistic communities. |
Anna Seo Gyeong Choi; Hoon Choi; | arxiv-cs.CL | 2025-08-09 |
| 131 | Improved Dysarthric Speech to Text Conversion Via TTS Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a case study on developing a customized speech-to-text system fora Hungarian speaker with severe dysarthria. |
PÉTER MIHAJLIK et. al. | arxiv-cs.SD | 2025-08-08 |
| 132 | SPGISpeech 2.0: Transcribed Multi-speaker Financial Audio for Speaker-tagged Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SPGISpeech 2.0, a dataset suitable for speaker-taggedtranscription in the financial domain. |
RAYMOND GROSSMAN et. al. | arxiv-cs.SD | 2025-08-07 |
| 133 | Pitch Accent Detection Improves Pretrained Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show the performance of Automatic Speech Recognition (ASR) systems thatuse semi-supervised speech representations can be boosted by a complimentarypitch accent detection module, by introducing a joint ASR and pitch accentdetection model. |
David Sasu; Natalie Schluter; | arxiv-cs.CL | 2025-08-06 |
| 134 | What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) atthe front end of their pipeline. The role of ASR in SDSs is to recognizeinformation in user speech related to response generation appropriately.Examining selective listening of humans, which refers to the ability to focuson and listen to important parts of a conversation during the speech, willenable us to identify the ASR capabilities required for SDSs and evaluate them.In this study, we experimentally confirmed selective listening when humansgenerate dialogue responses by comparing human transcriptions for generatingdialogue responses and reference transcriptions. |
KIYOTADA MORI et. al. | arxiv-cs.CL | 2025-08-06 |
| 135 | NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present NVSpeech, an integrated and scalable pipeline that bridgesthe recognition and synthesis of paralinguistic vocalizations, encompassingdataset construction, ASR modeling, and controllable TTS. |
HUAN LIAO et. al. | arxiv-cs.SD | 2025-08-06 |
| 136 | Efficient Scaling for LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Throughcomprehensive and controlled experiments, we find that pretraining the speechencoder before integrating it with the LLM leads to significantly betterscaling efficiency than the standard practice of joint post-training ofLLM-ASR. Based on this insight, we propose a new multi-stage LLM-ASR trainingstrategy, EFIN: Encoder First Integration. |
Bingshen Mu; Yiwen Shao; Kun Wei; Dong Yu; Lei Xie; | arxiv-cs.SD | 2025-08-06 |
| 137 | Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on ASRpost-correction, prior work requires 2 transcriptions, focuses only on isolatedequations, has a limited test set, and provides neither training data normultilingual coverage. To address these issues, we present the first fullyopen-source large-scale dataset, comprising over 66,000 human-annotated audiosamples of mathematical equations and sentences in both English and Russian,drawn from diverse scientific domains. |
DMITRII KORZH et. al. | arxiv-cs.CV | 2025-08-05 |
| 138 | RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented … |
Pengcheng Wang; Sheng Li; Takahiro Shinozaki; | ArXiv | 2025-08-05 |
| 139 | Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the issue, we propose a simple yeteffective Supervised Mixture of Experts (S-MoE). |
HOJUN JIN et. al. | arxiv-cs.CL | 2025-08-05 |
| 140 | Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a multi-modal retrieval-and-selection methodnamed MARS that augments conversational LLM-ASR by enabling it to retrieve andselect the most relevant acoustic and textual historical context for thecurrent utterance. |
Bingshen Mu; Hexin Liu; Hongfei Xue; Kun Wei; Lei Xie; | arxiv-cs.SD | 2025-08-01 |
| 141 | The Interspeech 2025 Speech Accessibility Project Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hosted on EvalAI and leveraging the remote evaluation pipeline,the SAP Challenge evaluates submissions based on Word Error Rate and SemanticScore. |
XIUWEN ZHENG et. al. | arxiv-cs.AI | 2025-07-29 |
| 142 | A Deep Learning Automatic Speech Recognition Model for Shona Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presented the development of a deep learning-based AutomaticSpeech Recognition system for Shona, a low-resource language characterized byunique tonal and grammatical complexities. |
Leslie Wellington Sirora; Mainford Mutandavari; | arxiv-cs.CL | 2025-07-28 |
| 143 | Self-Improvement for Audio Large Language Model Using Unlabeled Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a self-improvement methodcalled SI-SDA, leveraging the information embedded in large-model decoding toevaluate the quality of generated pseudo labels and then perform domainadaptation based on reinforcement learning optimization. |
Shaowen Wang; Xinyuan Chen; Yao Xu; | arxiv-cs.SD | 2025-07-27 |
| 144 | Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the impact of incorporating timestamp-basedalignment between Automatic Speech Recognition (ASR) transcripts and SpeakerDiarization (SD) outputs on Speech Emotion Recognition (SER) accuracy.Misalignment between these two modalities often reduces the reliability ofmultimodal emotion recognition systems, particularly in conversationalcontexts. |
Hsuan-Yu Wang; Pei-Ying Lee; Berlin Chen; | arxiv-cs.CL | 2025-07-25 |
| 145 | HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents HITSZ’s submission for the IWSLT 2025 Indic track,focusing on speech-to-text translation (ST) for English-to-Indic andIndic-to-English language pairs. |
XUCHEN WEI et. al. | arxiv-cs.CL | 2025-07-25 |
| 146 | The Eloquence Team Submission for Task 1 of MLC-SLM Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present our studies and experiments carried out for thetask 1 of the Challenge and Workshop on Multilingual Conversational SpeechLanguage Model (MLC-SLM), which focuses on advancing multilingualconversational speech recognition through the development of speech languagemodels architectures. |
Lorenzo Concina; Jordi Luque; Alessio Brutti; Marco Matassoni; Yuchen Zhang; | arxiv-cs.SD | 2025-07-25 |
| 147 | MLLM-based Speech Recognition: When and How Is Multimodality Beneficial? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on our prior work, this paper examines the conditions andmodel architectures under which multiple input modalities can improve automaticspeech recognition (ASR) accuracy in noisy environments. |
Yiwen Guan; Viet Anh Trinh; Vivek Voleti; Jacob Whitehill; | arxiv-cs.SD | 2025-07-25 |
| 148 | Phoneme-Level Visual Speech Recognition Via Point-Visual Fusion and Language Model Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existingmethods often aim to predict words or characters directly from visual cues, butthey commonly suffer from high error rates due to viseme ambiguity and requirelarge amounts of pre-training data. We propose a novel phoneme-based two-stageframework that fuses visual and landmark motion features, followed by an LLMmodel for word reconstruction to address these challenges. |
Matthew Kit Khinn Teng; Haibo Zhang; Takeshi Saitoh; | arxiv-cs.CV | 2025-07-24 |
| 149 | BoSS: Beyond-Semantic Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present aformalized framework for BoSS, leveraging cognitive relevance theories andmachine learning models to analyze temporal and contextual speech dynamics. |
QING WANG et. al. | arxiv-cs.SD | 2025-07-23 |
| 150 | Triple X: A LLM-Based Multilingual Speech Recognition System for The INTERSPEECH2025 MLC-SLM Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes our Triple X speech recognition system submitted to Task1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM)Challenge. |
Miaomiao Gao; Xiaoxiao Xiang; Yiwen Guo; | arxiv-cs.CL | 2025-07-23 |
| 151 | The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the TEA-ASLP’s system submitted to the MLC-SLM 2025Challenge, addressing multilingual conversational automatic speech recognition(ASR) in Task I and speech diarization ASR in Task II. |
Hongfei Xue; Kaixun Huang; Zhikai Zhou; Shen Huang; Shidong Shang; | arxiv-cs.SD | 2025-07-23 |
| 152 | MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MISP-Meeting, a new real-world, multimodal dataset that covers subject-oriented long-form content. |
HangChen HangChen; Chao-Han Huck Yang; Jia-Chen Gu; Sabato Marco Siniscalchi; Jun Du; | acl | 2025-07-21 |
| 153 | MultiMed: Multilingual Medical Speech Recognition Via Attention Encoder Decoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. |
KHAI LE-DUC et. al. | acl | 2025-07-21 |
| 154 | GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. |
YIFAN YANG et. al. | acl | 2025-07-21 |
| 155 | Dialectal Coverage And Generalization in Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. |
Amirbek Djanibekov; Hawau Olamide Toyin; Raghad Alshalan; Abdullah Alatir; Hanan Aldarmaki; | acl | 2025-07-21 |
| 156 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. |
JENALEA RAJAB et. al. | acl | 2025-07-21 |
| 157 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. |
JIAMING ZHOU et. al. | acl | 2025-07-21 |
| 158 | DNCASR: End-to-End Training for Speaker-Attributed ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. |
Xianrui Zheng; Chao Zhang; Phil Woodland; | acl | 2025-07-21 |
| 159 | That Doesn’t Sound Right: Evaluating Speech Transcription Quality in Field Linguistics Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore methods for identifying speech transcriptions in fieldwork data that may be unsuitable for training ASR models. |
Eric Le Ferrand; Bo Jiang; Joshua Hartshorne; Emily Prud’hommeaux; | acl | 2025-07-21 |
| 160 | Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e. g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. |
MINGFEI LAU et. al. | acl | 2025-07-21 |
| 161 | Weak Supervision Techniques Towards Enhanced ASR Models in Industry-level CRM Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this process faces thechallenge of discerning customer voices and intentions, and general pre-trainedautomatic speech recognition (ASR) models make it difficult to effectivelyaddress industry-specific speech recognition tasks. To address this issue, weinnovatively proposed a solution for fine-tuning industry-specific ASR models,which significantly improved the performance of the fine-tuned ASR models inindustry applications. |
ZHONGSHENG WANG et. al. | arxiv-cs.SD | 2025-07-19 |
| 162 | Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we improve ASRfor Catalan-Spanish CS by exploring three strategies: (1) generating syntheticCS data, (2) concatenating monolingual audio, and (3) leveraging real CS datawith language tokens. |
CARLOS MENA et. al. | arxiv-cs.CL | 2025-07-18 |
| 163 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a universal methodologyfor Arabic speech and text processing designed to address unique challenges ofthe language. |
Lilit Grigoryan; Nikolay Karpov; Enas Albasiri; Vitaly Lavrukhin; Boris Ginsburg; | arxiv-cs.CL | 2025-07-18 |
| 164 | Reading Between The Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study integrates pause features with semanticcoherence metrics across three datasets: naturalistic self-recorded diaries(AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dreamnarratives (PsyCL, n = 43). |
FENG CHEN et. al. | arxiv-cs.CL | 2025-07-17 |
| 165 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-accessdataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotionalcategories. |
Maksim Borisov; Egor Spirin; Daria Diatlova; | arxiv-cs.LG | 2025-07-17 |
| 166 | Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To supportthese on-the-ground efforts, the community is turning to digital technology.Automatic Speech Recognition (ASR) technology holds great promise foraccelerating language documentation and the creation of educational resources.However, developing ASR systems for SENCOTEN is challenging due to limited dataand significant vocabulary variation from its polysynthetic structure andstress-driven metathesis. To address these challenges, we propose an ASR-drivendocumentation pipeline that leverages augmented speech data from atext-to-speech (TTS) system and cross-lingual transfer learning with SpeechFoundation Models (SFMs). |
Mengzhe Geng; Patrick Littell; Aidan Pine; Marc Tessier; Roland Kuhn; | arxiv-cs.SD | 2025-07-14 |
| 167 | DQLoRA: A Lightweight Domain-Aware Denoising ASR Via Adapter-guided Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a demo of DQLoRA, an Adapter-Guided Distillation framework forrobust speech recognition under low-resource and noisy conditions. |
Yiru Yang; | arxiv-cs.SD | 2025-07-14 |
| 168 | ILT-Iterative LoRA Training Through Focus-Feedback-Fix for Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on Whisper-large-v3 andQwen2-Audio, we conduct systematic experiments using a three-stage trainingprocess: Focus Training, Feed Back Training, and Fix Training. |
Qingliang Meng; Hao Wu; Wei Liang; Wei Xu; Qing Zhao; | arxiv-cs.CL | 2025-07-11 |
| 169 | The Impact of Automatic Speech Transcription on Speaker Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct what is, to our knowledge, the firstcomprehensive study of the impact of automatic transcription on speakerattribution performance. |
Cristina Aggazzotti; Matthew Wiesner; Elizabeth Allyn Smith; Nicholas Andrews; | arxiv-cs.CL | 2025-07-11 |
| 170 | Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by a growing research interest into automatic speech recognition(ASR), and the growing body of work for languages in which code-switching (CS)often occurs, we present a systematic literature review of code-switching inend-to-end ASR models. |
Maha Tufail Agro; Atharva Kulkarni; Karima Kadaoui; Zeerak Talat; Hanan Aldarmaki; | arxiv-cs.CL | 2025-07-10 |
| 171 | Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Built on an extension of the LLM compressiontoolkit, our framework integrates edge-ASR models, diverse advancedquantization algorithms, a unified calibration and evaluation data pipeline,with detailed analysis tools. |
CHEN FENG et. al. | arxiv-cs.SD | 2025-07-10 |
| 172 | How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we compare different performance and bias measures,from literature and proposed, to evaluate state-of-the-art end-to-end ASRsystems for Dutch. |
Tanvina Patel; Wiebke Hutiri; Aaron Yi Ding; Odette Scharenborg; | arxiv-cs.CL | 2025-07-08 |
| 173 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To increase thecooperation between experts in different layers and encourage greaterspecialization, we use a shared router across different MoE layers. |
Zijin Gu; Tatiana Likhomanenko; Navdeep Jaitly; | arxiv-cs.CL | 2025-07-08 |
| 174 | Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing automatic speech recognition (ASR) research evaluate modelsusing in-domain datasets. |
MARK ATTA MENSAH et. al. | arxiv-cs.CL | 2025-07-03 |
| 175 | A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents an approach for collecting speech samples to buildAutomatic Speech Recognition (ASR) models for impaired speech, particularly,low-resource languages. |
SUMAYA AHMED SALIHS et. al. | arxiv-cs.CL | 2025-07-03 |
| 176 | Mind The Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel training approach thatextends the semantic context of ASR models by adding overlapping contextwindows during training. |
Duygu Altinok; | arxiv-cs.CL | 2025-06-28 |
| 177 | A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic … |
Phurich Saengthong; Boonnithi Jiaramaneepinit; Sheng Li; Manabu Okumura; Takahiro Shinozaki; | ArXiv | 2025-06-25 |
| 178 | OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. |
WILLIAM CHEN et. al. | icml | 2025-06-25 |
| 179 | MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. |
Sungnyun Kim; Kangwook Jang; Sangmin Bae; Sungwoo Cho; Se-Young Yun; | icml | 2025-06-25 |
| 180 | Our Collective Voices: The Social and Technical Values of A Grassroots Chinese Stuttered Speech Dataset Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The lack of authentic stuttered speech data has significantly limited the development of stuttering friendly automatic speech recognition (ASR) models. In previous work, we … |
Jingjin Li; Qisheng Li; Rong Gong; Lezhi Wang; Shaomei Wu; | Proceedings of the 2025 ACM Conference on Fairness, … | 2025-06-23 |
| 181 | Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition (ASR) systems. Despite recent … |
Niclas Pokel; Pehuen Moure; Roman Böhringer; Yingqiang Gao; | ArXiv | 2025-06-23 |
| 182 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reducing the frame rate is anatural solution, but standard techniques, such as rigid average pooling acrossframes, can distort or dilute the semantic structure required for effective LMalignment. To address this, we propose LM-SPT, a speech tokenization methodthat introduces a novel semantic distillation. |
Daejin Jo; Jeeyoung Yun; Byungseok Roh; Sungwoong Kim; | arxiv-cs.CL | 2025-06-20 |
| 183 | Breaking The Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we benchmark theperformance of two fine-tuned multilingual ASR models, MMS and XLS-R, on fivetypologically diverse low-resource languages with control of training dataduration. |
Siyu Liang; Gina-Anne Levow; | arxiv-cs.CL | 2025-06-20 |
| 184 | Automatic Speech Recognition Biases in Newcastle English: An Error Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates ASR performance on NewcastleEnglish, a well-documented regional dialect known to be challenging for ASR. |
Dana Serditova; Kevin Tang; Jochen Steffens; | arxiv-cs.CL | 2025-06-19 |
| 185 | Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces the integration of language-specific bi-directionalcontext into a speech large language model (SLLM) to improve multilingualcontinuous conversational automatic speech recognition (ASR). |
Yizhou Peng; Hexin Liu; Eng Siong Chng; | arxiv-cs.CL | 2025-06-16 |
| 186 | NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This report details the NTU Speechlab system developed for the Interspeech2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge(Task I), where we achieved 5th place. |
YIZHOU PENG et. al. | arxiv-cs.CL | 2025-06-16 |
| 187 | Seewo’s Submission to MLC-SLM: Lessons Learned from Speech Reasoning Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. |
Bo Li; Chengben Xu; Wufeng Zhang; | arxiv-cs.CL | 2025-06-16 |
| 188 | SC-SOT: Conditioning The Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Speaker-Conditioned Serialized Output Training (SC-SOT), an enhanced SOT-based training for E2E multi-talker ASR. |
Yuta Hirano; Sakriani Sakti; | arxiv-cs.SD | 2025-06-14 |
| 189 | Enabling Automatic Transcription of Child-centered Audio Recordings from Real-world Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an approach to automatically detect those utterances in longform audio that can be reliably transcribed with modern ASR systems, allowing automatic and relatively accurate transcription of a notable proportion of all speech in typical longform data. |
Daniil Kocharov; Okko Räsänen; | arxiv-cs.SD | 2025-06-13 |
| 190 | (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of A Phonetically Balanced Speech Test Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Simulated Phoneme Speech Test (SimPhon Speech Test) methodology, a novel, multi-stage computational pipeline for the in silico design and validation of a phonetically balanced minimal-pair speech test. |
Stefan Bleeck; | arxiv-cs.SD | 2025-06-13 |
| 191 | Adapting Whisper for Streaming Speech Recognition Via Two-Pass Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. |
HAORAN ZHOU et. al. | arxiv-cs.SD | 2025-06-13 |
| 192 | Improving Named Entity Transcription with Contextual LLM-based Revision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. |
Viet Anh Trinh; Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-06-12 |
| 193 | Efficient Multilingual ASR Finetuning Via LoRA Language Experts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and … |
JIAHONG LI et. al. | ArXiv | 2025-06-11 |
| 194 | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. |
CHENG-KANG CHOU et. al. | arxiv-cs.CL | 2025-06-10 |
| 195 | Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we identify three pitfalls inexisting standard ASR auditing procedures, and demonstrate how addressing themimpacts audit results via a case study of six popular ASR systems’ performancefor aphasia speakers. |
Katelyn Xiaoying Mei; Anna Seo Gyeong Choi; Hilke Schellmann; Mona Sloane; Allison Koenecke; | arxiv-cs.CY | 2025-06-10 |
| 196 | SimClass: A Classroom Speech Dataset Generated Via Game Engine Simulation For Automatic Speech Recognition Research Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a scalable methodology for synthesizing classroom noise using game engines, a framework that extends to other domains. |
Ahmed Adel Attia; Jing Liu; Carl Espy-Wilson; | arxiv-cs.SD | 2025-06-10 |
| 197 | Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This mechanism presents several issues, including unnecessary energy consumption, communication latency, and the risk of a single-point failure. In this position paper, we propose a smart home concept based on offline speech recognition and IoT technology: 1) integrating offline keyword spotting (KWS) technologies into household appliances with limited resource hardware to enable them to understand user voice commands; 2) designing a local IoT network with decentralized architecture to manage and connect various devices, enhancing the robustness and scalability of the system. |
PENG HUANG et. al. | arxiv-cs.SD | 2025-06-09 |
| 198 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. |
Rui Hu; Xiaolong Lin; Jiawang Liu; Shixi Huang; Zhenpeng Zhan; | arxiv-cs.CL | 2025-06-09 |
| 199 | Technical Report: A Practical Guide to Kaldi ASR Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This technical report introduces innovative optimizations for Kaldi-based Automatic Speech Recognition (ASR) systems, focusing on acoustic model enhancement, hyperparameter tuning, and language model efficiency. |
Mengze Hong; Di Jiang; | arxiv-cs.SD | 2025-06-08 |
| 200 | Speech Recognition on TV Series with Video-guided Post-ASR Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches failto explicitly leverage the rich temporal and contextual information availablein the video. To address this limitation, we propose a Video-Guided Post-ASRCorrection (VPC) framework that uses a Video-Large Multimodal Model (VLMM) tocapture video context and refine ASR outputs. |
Haoyuan Yang; Yue Zhang; Liqiang Jing; John H. L. Hansen; | arxiv-cs.SD | 2025-06-08 |
| 201 | Automatic Speech Recognition of African American English: Lexical and Contextual Effects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study focuses on two key AAE variables: Consonant Cluster Reduction(CCR) and ING-reduction. |
Hamid Mojarad; Kevin Tang; | arxiv-cs.CL | 2025-06-07 |
| 202 | Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model’s output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. |
CHRISTOPHE VAN GYSEL et. al. | arxiv-cs.CL | 2025-06-06 |
| 203 | ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. |
Thai-Binh Nguyen; Thi Van Nguyen; Quoc Truong Do; Chi Mai Luong; | arxiv-cs.CL | 2025-06-05 |
| 204 | Customizing Speech Recognition Model with Large Language Model Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose areinforcement learning based approach for unsupervised domain adaptation,leveraging unlabeled data to enhance transcription quality, particularly thenamed entities affected by domain mismatch, through feedback from a LLM. |
Shaoshi Ling; Guoli Ye; | arxiv-cs.CL | 2025-06-05 |
| 205 | LLM-based Phoneme-to-grapheme for Phoneme-based Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). |
Te Ma; Min Bi; Saierdaer Yusuyin; Hao Huang; Zhijian Ou; | arxiv-cs.SD | 2025-06-05 |
| 206 | Overcoming Data Scarcity in Multi-Dialectal Arabic ASR Via Whisper Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. |
Ömer Tarik Özyilmaz; Matt Coler; Matias Valdenegro-Toro; | arxiv-cs.CL | 2025-06-03 |
| 207 | A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. |
Verena Blaschke; Miriam Winkler; Constantin Förster; Gabriele Wenger-Glemser; Barbara Plank; | arxiv-cs.CL | 2025-06-03 |
| 208 | Whale: Large-Scale Multilingual ASR Model with W2v-BERT and E-Branchformer with Large Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports on the development of a large-scale speech recognition model, Whale. |
Yosuke Kashiwagi; Hayato Futami; Emiru Tsunoo; Satoshi Asakawa; | arxiv-cs.CL | 2025-06-02 |
| 209 | HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. |
AMIR HUSSEIN et. al. | arxiv-cs.CL | 2025-06-02 |
| 210 | WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data’s vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. |
Yu Nakagome; Michael Hentschel; | arxiv-cs.CL | 2025-06-01 |
| 211 | Causal Structure Discovery for Error Diagnostics of Children’s ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. |
Vishwanath Pratap Singh; Md. Sahidullah; Tomi Kinnunen; | arxiv-cs.CL | 2025-05-31 |
| 212 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. |
YANGUI FANG et. al. | arxiv-cs.CL | 2025-05-30 |
| 213 | Pseudo Labels-based Neural Speech Enhancement for The AVSR Task in The MISP-Meeting Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents our system for the MISP-Meeting Challenge Track 2. |
Longjie Luo; Shenghui Lu; Lin Li; Qingyang Hong; | arxiv-cs.SD | 2025-05-30 |
| 214 | MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). |
Dimitrios Damianos; Georgios Paraskevopoulos; Alexandros Potamianos; | arxiv-cs.CL | 2025-05-30 |
| 215 | Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. |
SUJEET KUMAR et. al. | arxiv-cs.CL | 2025-05-30 |
| 216 | Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. |
Qingzheng Wang; Jiancheng Sun; Yifan Peng; Shinji Watanabe; | arxiv-cs.SD | 2025-05-30 |
| 217 | Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. |
Griffin Dietz Smith; Dianna Yee; Jennifer King Chen; Leah Findlater; | arxiv-cs.LG | 2025-05-29 |
| 218 | Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. |
YOUJUN CHEN et. al. | arxiv-cs.SD | 2025-05-29 |
| 219 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. |
Zhennan Lin; Kaixun Huang; Wei Ren; Linju Yang; Lei Xie; | arxiv-cs.SD | 2025-05-29 |
| 220 | Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposed an LLM-driven ASR-SED multi-task learning framework that jointly optimized the ASR and Stuttering Event Detection (SED) tasks. |
Shangkun Huang; Jing Deng; Jintao Kang; Rong Zheng; | arxiv-cs.SD | 2025-05-28 |
| 221 | Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. |
SHANGKUN HUANG et. al. | arxiv-cs.SD | 2025-05-28 |
| 222 | Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. |
MINGCHEN SHAO et. al. | arxiv-cs.SD | 2025-05-28 |
| 223 | AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark. |
YUHANG DAI et. al. | arxiv-cs.SD | 2025-05-28 |
| 224 | Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. |
Stefan Bleeck; | arxiv-cs.SD | 2025-05-28 |
| 225 | Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. |
Titouan Parcollet; Yuan Tseng; Shucong Zhang; Rogier van Dalen; | arxiv-cs.CL | 2025-05-27 |
| 226 | GMU Systems for The IWSLT 2025 Low-Resource Speech Translation Shared Task Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. |
Chutong Meng; Antonios Anastasopoulos; | arxiv-cs.CL | 2025-05-27 |
| 227 | Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. |
ZHAOQING LI et. al. | arxiv-cs.SD | 2025-05-27 |
| 228 | Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. |
TIANYI XU et. al. | arxiv-cs.CL | 2025-05-27 |
| 229 | Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. |
Dancheng Liu; Amir Nassereldine; Chenhui Xu; Jinjun Xiong; | arxiv-cs.CL | 2025-05-26 |
| 230 | In-context Language Learning for Endangered Languages in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). |
Zhaolin Li; Jan Niehues; | arxiv-cs.CL | 2025-05-26 |
| 231 | KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. |
ZHAOLIN LI et. al. | arxiv-cs.CL | 2025-05-26 |
| 232 | Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. |
Raphaël Bagat; Irina Illina; Emmanuel Vincent; | arxiv-cs.CL | 2025-05-26 |
| 233 | Exploring Generative Error Correction for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). |
Moreno La Quatra; Alkis Koudounas; Valerio Mario Salerno; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-05-26 |
| 234 | BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. |
Xun Gong; Anqi Lv; Zhiming Wang; Huijia Zhu; Yanmin Qian; | arxiv-cs.SD | 2025-05-25 |
| 235 | Large Language Models Based ASR Error Correction for Child Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. |
ANFENG XU et. al. | arxiv-cs.CL | 2025-05-22 |
| 236 | An Effective Training Framework for Light-Weight Automatic Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. |
Abdul Hannan; Alessio Brutti; Shah Nawaz; Mubashir Noman; | arxiv-cs.CV | 2025-05-22 |
| 237 | Differentiable K-means for Fully-optimized Discrete Token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. |
Kentaro Onda; Yosuke Kashiwagi; Emiru Tsunoo; Hayato Futami; Shinji Watanabe; | arxiv-cs.SD | 2025-05-22 |
| 238 | Prosodically Enhanced Foreign Accent Simulation By Discrete Token-based Resynthesis Only with Native Speech Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. |
Kentaro Onda; Keisuke Imoto; Satoru Fukayama; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2025-05-21 |
| 239 | Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. |
Hongfei Xue; Yufeng Tang; Jun Zhang; Xuelong Geng; Lei Xie; | arxiv-cs.SD | 2025-05-21 |
| 240 | Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challengingdue to data scarcity, particularly in non-English languages. To address this,we fine-tune a voice … |
CHIN-JOU LI et. al. | arxiv-cs.CL | 2025-05-20 |
| 241 | Helium Speech Recognition Method Based on Spectrogram with Deep Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is … |
Yonghong Chen; Shibing Zhang; Dongmei Li; | Big Data Cogn. Comput. | 2025-05-20 |
| 242 | PersonaTAB: Predicting Personality Traits Using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. |
Sho Inoue; Shai Wang; Haizhou Li; | arxiv-cs.SD | 2025-05-20 |
| 243 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2025-05-20 |
| 244 | Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. |
HAOYANG ZHANG et. al. | arxiv-cs.CL | 2025-05-20 |
| 245 | Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present our submission to the Speech Accessibility Project challenge for dysarthric speech recognition. |
DOMINIK WAGNER et. al. | arxiv-cs.SD | 2025-05-19 |
| 246 | Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. |
Xinlu He; Jacob Whitehill; | arxiv-cs.CL | 2025-05-16 |
| 247 | ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. |
Anand Rai; Satyam Rahangdale; Utkarsh Anand; Animesh Mukherjee; | arxiv-cs.SD | 2025-05-16 |
| 248 | Multi-Stage Speaker Diarization for Noisy Classrooms Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study investigates the effectiveness of multi-stage diarization models using Nvidia’s NeMo diarization pipeline. |
Ali Sartaz Khan; Tolulope Ogunremi; Ahmed Adel Attia; Dorottya Demszky; | arxiv-cs.SD | 2025-05-16 |
| 249 | LegoSLM: Connecting LLM with Speech Encoder Using CTC Posteriors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. |
Rao Ma; Tongzhou Chen; Kartik Audhkhasi; Bhuvana Ramabhadran; | arxiv-cs.CL | 2025-05-16 |
| 250 | Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. |
SUKAIRAJ HAFIZ IMAM et. al. | arxiv-cs.CL | 2025-05-16 |
| 251 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. |
Linrong Pan; Chenglong Jiang; Gaoze Hou; Ying Gao; | arxiv-cs.CL | 2025-05-08 |
| 252 | Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. |
YANG LI et. al. | naacl | 2025-05-04 |
| 253 | Pisets: A Robust Speech Recognition System for Lectures and Interviews Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a speech-to-text system “Pisets” for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. |
IVAN BONDARENKO et. al. | naacl | 2025-05-04 |
| 254 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | naacl | 2025-05-04 |
| 255 | AMPS: ASR with Multimodal Paraphrase Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a new technique AMPS, that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | naacl | 2025-05-04 |
| 256 | Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Wav2Prompt uses a straightforward training process with only the same data used to train an automatic speech recognition (ASR) model. |
Keqi Deng; Guangzhi Sun; Phil Woodland; | naacl | 2025-05-04 |
| 257 | BERSting at The Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Here we present theB(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. |
PAIGE TUTTÖSÍ et. al. | arxiv-cs.CL | 2025-04-30 |
| 258 | “It Feels Like We’re Not Meeting The Criteria”: Examining and Mitigating The Cascading Effects of Bias in Automatic Speech Recognition in Spoken Language Interfaces Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Researchers have demonstrated that Automatic Speech Recognition (ASR) systems perform differently across demographic groups (i.e. show bias), yet their downstream impact on spoken … |
Kelechi Ezema; Chelsea Chandler; Rosy Southwell; Niranjan Cholendiran; Sidney D’Mello; | Proceedings of the 2025 CHI Conference on Human Factors in … | 2025-04-25 |
| 259 | Chinese-LiPS: A Chinese Audio-visual Speech Recognition Dataset with Lip-reading and Presentation Slides Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. |
JINGHUA ZHAO et. al. | arxiv-cs.MM | 2025-04-21 |
| 260 | Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. |
MAHMOUD SALHAB et. al. | arxiv-cs.AI | 2025-04-16 |
| 261 | Dysarthric Speech Conformer: Adaptation for Sequence-to-Sequence Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a two-phase adaptation pipeline based on the Conformer architecture that leverages typical speech to transfer to individualized ASR models for dysarthric speakers. |
Q. Wang; | icassp | 2025-04-15 |
| 262 | Retrieval Augmented Correction of Named Entity Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. |
E. Pusateri; | icassp | 2025-04-15 |
| 263 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
H. Wang; | icassp | 2025-04-15 |
| 264 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversations between adults. |
G. Maheshwari; D. Ivanov; T. Johannet; K. El Haddad; | icassp | 2025-04-15 |
| 265 | Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, a Fast estimator for WER (Fe-WER) is introduced, utilizing average pooling over self-supervised learning representations for speech and text. |
C. Park; C. Lu; M. Chen; T. Hain; | icassp | 2025-04-15 |
| 266 | Advancing Streaming ASR with Chunk-wise Attention and Trans-chunk Selective State Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores enhancing streaming speech recognition through the integration of chunk-wise attention and selective state space models (SSMs). |
M. Mimura; T. Moriya; K. Matsuura; | icassp | 2025-04-15 |
| 267 | Revise, Reason, and Recognize: LLM-Based Emotion Recognition Via Emotion-Specific Prompts and ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. |
Y. Li; Y. Gong; C. -H. H. Yang; P. Bell; C. Lai; | icassp | 2025-04-15 |
| 268 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
C. -C. WANG et. al. | icassp | 2025-04-15 |
| 269 | Mamba for Streaming ASR Combined with Unimodal Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. |
Y. Fang; X. Li; | icassp | 2025-04-15 |
| 270 | Contextual ASR with Retrieval Augmented Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose leveraging large language models (LLMs) and retrieval-augmented generation (RAG) to enhance the contextual capabilities of ASR systems. |
C. Xiao; Z. Hou; D. Garcia-Romero; K. J. Han; | icassp | 2025-04-15 |
| 271 | StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose StableQuant, a novel adaptive post-training quantization (PTQ) algorithm for widely used speech foundation models (SFMs). |
Y. Hong; H. Han; W. -J. Chung; H. -G. Kang; | icassp | 2025-04-15 |
| 272 | ATP-TTS: Adaptive Thresholding Pseudo-Labeling for Low-Resource Multi-Speaker Text-to-Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge of high annotation costs in text-to-speech (TTS) generation, this paper introduces a semi-supervised learning framework specifically designed for low-resource TTS scenarios. |
F. Li; S. Chen; H. Yang; S. Yuan; | icassp | 2025-04-15 |
| 273 | Speech Recognition for Automatically Assessing Afrikaans and IsiXhosa Preschool Oral Narratives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. |
C. JACOBS et. al. | icassp | 2025-04-15 |
| 274 | Improving Dialect Identification in Indian Languages Using Multimodal Features from Dialect Informed ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel multimodal architecture that leverages speech and text features to enhance DID performance. |
icassp | 2025-04-15 | |
| 275 | AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition Using Agnostic Contrastive Mixup Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). |
C. Carvalho; A. Abad; | icassp | 2025-04-15 |
| 276 | Self-Information Guided Speech Segmentation for Efficient Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel method that leverages self-information, a measure of the information contained within an utterance, as a supervisory signal for speech segmentation. |
W. S. Teo; Y. Minami; | icassp | 2025-04-15 |
| 277 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
Z. Zhuang; | icassp | 2025-04-15 |
| 278 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose M2R-Whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
J. Zhou; | icassp | 2025-04-15 |
| 279 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
N. B. Shankar; Z. Wang; E. Eren; A. Alwan; | icassp | 2025-04-15 |
| 280 | Improved Recognition of The Speech of People with Parkinson’s Who Stutter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a novel stuttered speech data augmentation approach to improve dysarthric speech recognition. |
J. Na; X. Zheng; B. Lee; M. Hasegawa-Johnson; | icassp | 2025-04-15 |
| 281 | EFL-PEFT: A Communication Efficient Federated Learning Framework Using PEFT Sparsification for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we consolidate the use of PEFT for ASR with pre-trained models, demonstrating that it enables efficient FL reducing the amount of parameters to share with respect to full fine-tuning. |
M. N. Ali; D. Falavigna; A. Brutti; | icassp | 2025-04-15 |
| 282 | Generating Targeted Universal Adversarial Perturbation Against Automatic Speech Recognition Via Phoneme Tailoring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, to improve attack ability, we propose a Diverse Audio Composition Enrichment method, which enhances the utilization of audio features through phoneme-level slicing and recombination. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
| 283 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
T. Parcollet; R. van Dalen; S. Zhang; S. Bhattacharya; | icassp | 2025-04-15 |
| 284 | Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). |
N. Moritz; | icassp | 2025-04-15 |
| 285 | META-CAT: Speaker-Informed Speech Embeddings Via Meta Information Concatenation for Multi-talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. |
J. Wang; | icassp | 2025-04-15 |
| 286 | Chain-of-Thought Prompting for Speech Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
K. Hu; | icassp | 2025-04-15 |
| 287 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
L. Meng; | icassp | 2025-04-15 |
| 288 | LLM Based Text Generation for Improved Low-resource Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prompting a large language model (LLM) to paraphrase input text can generate novel text data that is constrained to be semantically similar to the source data. We leverage this capability of LLMs to improve the performance of low-resource ASR systems by increasing the limited text training data while keeping the same spoken style. |
T. Nagano; | icassp | 2025-04-15 |
| 289 | A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. |
Y. Jiang; B. Tian; | icassp | 2025-04-15 |
| 290 | PersoDA: Personalized Data Augmentation for Personalized ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user’s data utilized to personalize ASR [1] –[3]. |
P. P. Parada; | icassp | 2025-04-15 |
| 291 | Harnessing The Zero-Shot Power of Instruction-Tuned Large Language Model for Guiding End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). |
Y. Higuchi; T. Ogawa; T. Kobayashi; | icassp | 2025-04-15 |
| 292 | Joint Training Framework for Accent and Speech Recognition Based on Conformer Low-Rank Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study introduces the Conformer Low-rank Adaptation for Joint Accent and Speech Recognition (CLAnSR), employing LoRA to augment both ASR and AR capabilities using a shared pre-trained base encoder. |
X. Zhuang; Y. Qian; S. Xu; M. Wang; | icassp | 2025-04-15 |
| 293 | Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps. |
N. L. Kühne; | icassp | 2025-04-15 |
| 294 | Alignment-Free Training for Transducer-based Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. |
T. Moriya; | icassp | 2025-04-15 |
| 295 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) scenario in Automatic Speech Recognition (ASR). |
F. ZHANG et. al. | icassp | 2025-04-15 |
| 296 | Speech Recognition with LLMs Adapted to Disordered Speech Using Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. |
C. NAGPAL et. al. | icassp | 2025-04-15 |
| 297 | ValSub: Subsampling Validation Data to Mitigate Forgetting During ASR Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. |
H. Mehmood; | icassp | 2025-04-15 |
| 298 | Towards A Single ASR Model That Generalizes to Disordered Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates the impact of integrating a dataset of disordered speech recordings (~1,000 hours) into the fine-tuning of a near state-of-the-art ASR baseline system. |
J. Tobin; K. Tomanek; S. Venugopalan; | icassp | 2025-04-15 |
| 299 | Continuously Learning New Words in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. |
C. Huber; A. Waibel; | icassp | 2025-04-15 |
| 300 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. |
J. HU et. al. | icassp | 2025-04-15 |
| 301 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
J. Zhou; | icassp | 2025-04-15 |
| 302 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
K. Mundnich; | icassp | 2025-04-15 |
| 303 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
T. Feng; | icassp | 2025-04-15 |
| 304 | Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) model. |
R. Fukuda; T. Kano; A. Ando; A. Ogawa; | icassp | 2025-04-15 |
| 305 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. |
A. A. Attia; D. Demszky; T. Ògúnremí; J. Liu; C. Espy-Wilson; | icassp | 2025-04-15 |
| 306 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
S. Basak; A. Padarthi; M. Gowda; | icassp | 2025-04-15 |
| 307 | Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to decouple the training of the multi-channel speaker separation frontend and the ASR backend, with the latter trained only on clean speech. |
Y. Yang; H. Taherian; V. A. Kalkhorani; D. Wang; | icassp | 2025-04-15 |
| 308 | Using Corrected ASR Projection to Improve AD Recognition Performance from Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, Automatic Speech Recognition transcription errors, stemming from language impairments in AD and Mild Cognitive Impairment patients, can lead to information loss during feature extraction. To mitigate this, we introduce the Corrected ASR Projecting, CAP model. |
Y. ZHANG et. al. | icassp | 2025-04-15 |
| 309 | Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel contextual biasing approach that employs the Multi-label Synchronous Output CTC (MCTC) algorithm to enhance the synchronization between ASR and bias task outputs. |
M. Fang; | icassp | 2025-04-15 |
| 310 | Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classification alongside ASR output. |
C. Y. KWOK et. al. | icassp | 2025-04-15 |
| 311 | Speech Recognition Rescoring with Large Speech-Text Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose novel techniques to use multi-modal LLM for ASR rescoring. |
P. G. Shivakumar; | icassp | 2025-04-15 |
| 312 | Speech Enhancement with MAP-based Training for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hence, this study proposes a maximum a posteriori (MAP) algorithm for training SE models by incorporating the posterior probability of clean speech, given the enhanced speech, into the loss function. |
Y. -J. Li; R. Chao; B. Su; Y. Tsao; | icassp | 2025-04-15 |
| 313 | Audio Diffusion with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore an alternate approach to the popular method of using large language models (LLMs) as a second decoder for Automated Speech Recognition (ASR) and speech understanding tasks. |
Y. Huang; K. Kastner; K. Audhkhasi; B. Ramabhadran; A. Rosenberg; | icassp | 2025-04-15 |
| 314 | Speech Re-Painting for Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce speech re-painting, a method for in-context augmented synthesis, using target training datasets to generate new utterances guided by speech and text on the fly in a zero-shot manner. |
K. Kastner; | icassp | 2025-04-15 |
| 315 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
T. Abu; Y. Shi; T. F. Zheng; D. Wang; | icassp | 2025-04-15 |
| 316 | Speech Few-Shot Learning for Language Learners’ Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports how speech recognition accuracy can be improved using the speech few-shot in-context learning capabilities of a multimodal foundation model when applied to the speech of language learners. |
J. Cheng; S. Nguyen; | icassp | 2025-04-15 |
| 317 | Learning Rich Speech Representations with Acoustic-Semantic Factorization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, the entanglement of acoustic and semantic information can undermine model robustness, particularly in varied acoustic environments. To address these issues, we propose a two-branch multitask finetuning strategy that integrates Automatic Speech Recognition and transcript-aligned audio reconstruction, designed to preserve and disentangle semantic and acoustic information in a final layer of a pretrained model. |
M. Niu; | icassp | 2025-04-15 |
| 318 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
T. Hori; M. Kocour; A. Haider; E. McDermott; X. Zhuang; | icassp | 2025-04-15 |
| 319 | LLM Supervised Pre-training for Multimodal Emotion Recognition in Conversations IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. |
S. Dutta; S. Ganapathy; | icassp | 2025-04-15 |
| 320 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
B. Yan; V. Pratap; S. Watanabe; M. Auli; | icassp | 2025-04-15 |
| 321 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
Z. Cui; | icassp | 2025-04-15 |
| 322 | Efficient Long-Form Speech Recognition for General Speech In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. |
H. Yen; S. Ling; G. Ye; | icassp | 2025-04-15 |
| 323 | Bridging The Modality Gap for Speech-image Retrieval with Text Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to leverage text supervision to facilitate the alignment between speech and image feature spaces via an automatic speech recognition (ASR) auxiliary task. |
Y. Yang; L. Zhou; Y. Li; G. Ma; | icassp | 2025-04-15 |
| 324 | Advancing Non-intrusive Suppression on Enhancement Distortion for Noise Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While various methods have enhanced recognition accuracy in SE-ASR systems, they often require fine-tuning or re-training of SE or ASR models, which is impractical in many real-world applications. In this paper, we propose a lightweight distortion suppression (DS) network that addresses these artifacts without modifying the SE or ASR models, treating them as fixed black boxes. |
W. Wang; S. Zhao; Y. Qian; | icassp | 2025-04-15 |
| 325 | Chinese Speech Processing Via Chinese Character Feature Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper focuses on the basic structure of Chinese characters: semantic-phonetic compound characters. This paper takes advantage of this feature of Chinese characters and innovatively proposes a Chinese speech-processing method based on character shape. |
R. Jiang; Z. Yang; W. Xi; X. Fu; J. Zhao; | icassp | 2025-04-15 |
| 326 | Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). |
G. CHENG et. al. | icassp | 2025-04-15 |
| 327 | UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, training large ASR models from scratch remains costly. To address this issue, we introduce UME, a novel method that efficiently Upcycles pretrained dense ASR checkpoints into larger Mixture-of-Eperts (MoE) architectures. |
L. FU et. al. | icassp | 2025-04-15 |
| 328 | Automatic Speech Recognition and Spoken Language Understanding of Maritime Radio Communications: A Case Study with Singapore Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents several contributions designed to improve the ASR and the SLU systems by releasing a dataset for ASR and SLU task in maritime domain. |
P. Dat; J. M. Madhathil; T. Huy Dat; | icassp | 2025-04-15 |
| 329 | Enhancing Multilingual ASR for Unseen Languages Via Language Embedding Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, despite its success, Whisper struggles with unseen languages, which are not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose a method that exploits these relationships to improve ASR performance of Whisper in unseen languages. |
S. -S. Huang; K. -P. Huang; A. T. Liu; H. -Y. Lee; | icassp | 2025-04-15 |
| 330 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Z. Tang; D. Wang; S. Huang; S. Shang; | icassp | 2025-04-15 |
| 331 | M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. |
Y. Yang; | icassp | 2025-04-15 |
| 332 | Speech Retrieval-Augmented Generation Without Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. |
D. J. MIN et. al. | icassp | 2025-04-15 |
| 333 | From Characters to Subwords: Modeling Unit Conversion for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel low-resource ASR method that leverages the advantages of two different modeling units. |
Y. Wang; H. Zhang; H. Wang; L. Sun; M. Song; | icassp | 2025-04-15 |
| 334 | Injecting Visual Features Into Whisper for Parameter-Efficient Noise-Robust Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This gap highlights the need for more efficient methods to leverage visual and acoustic information in AVSR tasks. To address this challenge, we propose AVWhisper, a parameter-efficient model that integrates visual and acoustic representations by injecting visual features from the AV-HuBERT encoder into the pre-trained Whisper model. |
Z. Yang; | icassp | 2025-04-15 |
| 335 | Token-Level Contextual Network with Ladder-Shaped Attention for End-to-End ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: More importantly, we propose a creative approach to address the challenge of the increasing size of expanding token list compared to phrase list. |
M. Fang; | icassp | 2025-04-15 |
| 336 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. |
L. Grigoryan; N. Karpov; E. Albasiri; V. Lavrukhin; B. Ginsburg; | icassp | 2025-04-15 |
| 337 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
E. Sarkar; M. Magimai.-Doss; | icassp | 2025-04-15 |
| 338 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
J. Zhao; | icassp | 2025-04-15 |
| 339 | Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. |
R. Nasretdinov; R. Korostik; A. Jukić; | icassp | 2025-04-15 |
| 340 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
M. BARAŃSKI et. al. | icassp | 2025-04-15 |
| 341 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
A. S. Gyeong Choi; J. Park; M. Oh; | icassp | 2025-04-15 |
| 342 | Adopting Whisper for Confidence Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. |
V. Aggarwal; S. S. Nair; Y. Verma; Y. Jogi; | icassp | 2025-04-15 |
| 343 | Dynamic Language Group-based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
H. Huang; | icassp | 2025-04-15 |
| 344 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
U. Cappellazzo; | icassp | 2025-04-15 |
| 345 | Enhancing Low-Resource ASR Through Versatile TTS: Bridging The Data Gap IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. |
G. Yang; | icassp | 2025-04-15 |
| 346 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
T. -B. Nguyen; A. Waibel; | icassp | 2025-04-15 |
| 347 | Adaptive Decoding for Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an adaptive decoding method (ADD) to reduce the latency. |
X. Ma; | icassp | 2025-04-15 |
| 348 | Can Automated Speech Recognition Errors Provide Valuable Clues for Alzheimer’s Disease Detection? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Finally, we conduct an interpretability study, including linguistic and SHapley Additive exPlanations (SHAP) analyses. This study reveals that greater word distribution differences between AD and healthy control (HC) groups in ASR transcripts may be linked to these valuable clues. |
Y. -L. Liu; | icassp | 2025-04-15 |
| 349 | Visual-Aware Speech Recognition for Noisy Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. |
Lakshmipathi Balaji; Karan Singla; | arxiv-cs.CL | 2025-04-09 |
| 350 | IndicST: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The integration of speech modalities into large language models, known as Speech LLMs, is a promising area of research for applications like automatic speech recognition (ASR) and … |
Sanket Shah; Kavya Ranjan Saxena; Kancharana Manideep Bharadwaj; Sharath Adavanne; Nagaraj Adiga; | 2025 IEEE International Conference on Acoustics, Speech, … | 2025-04-06 |
| 351 | Speech Few-Shot Learning for Language Learners’ Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Jian Cheng; Sam Nguyen; | IEEE International Conference on Acoustics, Speech, and … | 2025-04-06 |
| 352 | LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect’s linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets — comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. |
Hedi Naouara; Jean-Pierre Lorré; Jérôme Louradour; | arxiv-cs.CL | 2025-04-03 |
| 353 | Chain of Correction for Full-text Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, challenges remainregarding stability, controllability, completeness, and fluency. To mitigatethese issues, this paper proposes the Chain of Correction (CoC), which uses amulti-turn chat format to correct errors segment by segment, guided bypre-recognized text and full-text context for better semantic understanding.Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM toevaluate CoC’s performance. |
ZHIYUAN TANG et. al. | arxiv-cs.CL | 2025-04-02 |
| 354 | Whispering Under The Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. |
WEIFEI JIN et. al. | arxiv-cs.CR | 2025-04-01 |
| 355 | The Impact of Code-switched Synthetic Data Quality Is Task Dependent: Insights from MT and ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. |
Injy Hamed; Ngoc Thang Vu; Nizar Habash; | arxiv-cs.CL | 2025-03-30 |
| 356 | Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. |
YANGYANG MENG et. al. | arxiv-cs.CL | 2025-03-26 |
| 357 | Whispering in Amharic: Fine-tuning Whisper for Low-resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research. |
DAWIT KETEMA GETE et. al. | arxiv-cs.CL | 2025-03-24 |
| 358 | Elevating Robust Multi-Talker ASR By Decoupling Speaker Separation and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. |
Yufeng Yang; Hassan Taherian; Vahid Ahmadi Kalkhorani; DeLiang Wang; | arxiv-cs.SD | 2025-03-22 |
| 359 | Your Voice Is Your Voice: Supporting Self-expression Through Speech Generation and LLMs in Augmented and Alternative Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users’ expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). |
YIWEN XU et. al. | arxiv-cs.HC | 2025-03-21 |
| 360 | Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
| 361 | Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We conducted a user study with 75 participants to evaluate the feasibility and efficiency of this workflow. |
Korbinian Kuhn; Verena Kersken; Gottfried Zimmermann; | arxiv-cs.HC | 2025-03-19 |
| 362 | Automatic Speech Recognition: Comparisons Between Convolutional Neural Networks, Hidden Markov Model And Hybrid Architecture Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems have been widely used as a practical method of interaction between humans and devices. They are typically employed to enhance the … |
Lyndainês Santos; Nícolas de Araújo Moreira; Robson Sampaio; Raizielle Lima; Francisco Carlos Mattos Brito Oliveira; | Expert Systems | 2025-03-19 |
| 363 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. |
AOHAN ZENG et. al. | iclr | 2025-03-17 |
| 364 | Speech Robust Bench: A Robustness Benchmark For Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. |
Muhammad A Shah; David Solans Noguero; Mikko A. Heikkilä; Bhiksha Raj; Nicolas Kourtellis; | iclr | 2025-03-17 |
| 365 | CR-CTC: Consistency Regularization on CTC for Improved Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. |
ZENGWEI YAO et. al. | iclr | 2025-03-17 |
| 366 | T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis Via Multitask Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce T2V2 (**T**ext to **V**oice and **V**oice to **T**ext), a unified non-autoregressive model capable of performing both automatic speech recognition (ASR) and text-to-speech (TTS) synthesis within the same framework. |
Nabarun Goswami; Hanqin Wang; Tatsuya Harada; | iclr | 2025-03-17 |
| 367 | Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively. |
Soumya Shamarao Jahagirdar; Jayasree Saha; C V Jawahar; | arxiv-cs.CV | 2025-03-11 |
| 368 | Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases. |
Michael McGuire; | arxiv-cs.CL | 2025-03-10 |
| 369 | Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. |
Lucas Block Medin; Thomas Pellegrini; Lucile Gelin; | arxiv-cs.SD | 2025-03-06 |
| 370 | Large Language Models Cover for Speech Recognition Mistakes: Evaluating Conversational AI for Second Language Learners Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) technology has been reported to reach near-human performance in recent years, yet it continues to struggle with atypical speakers, particularly … |
Eva Verhelst; Tony Belpaeme; | 2025 20th ACM/IEEE International Conference on Human-Robot … | 2025-03-04 |
| 371 | Speaking with Robots in Noisy Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: A fundamental limitation for speech-enabled human-robot interaction (HRI) is automatic speech recognition (ASR), or the process of converting a raw speech signal into text. If … |
Shuubham Ojha; Felix Gervits; Carol Y. Espy-Wilson; | 2025 20th ACM/IEEE International Conference on Human-Robot … | 2025-03-04 |
| 372 | Direct Speech to Speech Translation: A Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our review examines the evolution of S2ST, comparing traditional cascade models which rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components with newer end to end and direct speech translation (DST) models that bypass intermediate text representations. |
Mohammad Sarim; Saim Shakeel; Laeeba Javed; Mohammad Nadeem; | arxiv-cs.CL | 2025-03-03 |
| 373 | Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study explores fine-tuning OpenAI’s Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. |
Samuel S. Sohn; Sten Knutsen; Karin Stromswold; | arxiv-cs.SD | 2025-03-03 |
| 374 | Unveiling Biases While Embracing Sustainability: Assessing The Dual Challenges of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. |
Ajinkya Kulkarni; Atharva Kulkarni; Miguel Couceiro; Isabel Trancoso; | arxiv-cs.CL | 2025-03-02 |
| 375 | Hybrid RMDL-CNN for Speech Recognition from Unclear Speech Signal Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Raja Bhargava; N. Arivazhagan; Kunchala Suresh Babu; | International Journal of Speech Technology | 2025-03-01 |
| 376 | Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents the development of ASR models fine-tuned specifically for Southeast Asian accents using a newly created dataset. |
MARCUS YU ZHE WEE et. al. | arxiv-cs.LG | 2025-02-27 |
| 377 | CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2025-02-26 |
| 378 | Speed Master: Quick or Slow Play to Attack Speaker Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our paper presents a novel attack methodology named Speed Master, which undermines deep neural networks by manipulating the speed of speech samples. |
ZHE YE et. al. | aaai | 2025-02-25 |
| 379 | Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. |
YIHAN WU et. al. | aaai | 2025-02-25 |
| 380 | Exploring Gender Disparities in Automatic Speech Recognition Technology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates factors influencing Automatic Speech Recognition (ASR) systems’ fairness and performance across genders, beyond the conventional examination of demographics. |
Hend ElGhazaly; Bahman Mirheidari; Nafise Sadat Moosavi; Heidi Christensen; | arxiv-cs.CL | 2025-02-25 |
| 381 | Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | aaai | 2025-02-25 |
| 382 | Uncertainty-Aware Self-Training for CTC-Based Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the current sequence-level uncertainty estimation method for connectionist temporal classification (CTC) based ASR models drops the output probability information and depends only on the textual distance of decoded predictions. In this study, we argue that this results in limited performance improvement and propose a novel output probability-based sequence-level uncertainty estimation method. |
Eungbeom Kim; Kyogu Lee; | aaai | 2025-02-25 |
| 383 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). |
KULUHAN BINICI et. al. | aaai | 2025-02-25 |
| 384 | Silent Speech Sentence Recognition with Six-Axis Accelerometers Using Conformer and CTC Algorithm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A novel silentspeech sentence recognition method is proposed to convert the facial motionsignals collected by six-axis accelerometers into transcribed words andsentences. |
YUDONG XIE et. al. | arxiv-cs.HC | 2025-02-24 |
| 385 | Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. |
Haoxuan Wang; | arxiv-cs.CL | 2025-02-22 |
| 386 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. |
JENALEA RAJAB et. al. | arxiv-cs.CL | 2025-02-21 |
| 387 | On The Robust Approximation of ASR Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. |
Abdul Waheed; Hanin Atwany; Rita Singh; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
| 388 | Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. |
Hanin Atwany; Abdul Waheed; Rita Singh; Monojit Choudhury; Bhiksha Raj; | arxiv-cs.CL | 2025-02-17 |
| 389 | DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. |
XIANGYU LU et. al. | arxiv-cs.CL | 2025-02-16 |
| 390 | MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. |
Qingliang Meng; Pengju Ren; Tian Li; Changsong Dai; Huizhi Liang; | arxiv-cs.CL | 2025-02-14 |
| 391 | A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze three main architectural approaches: Network Speech Recognition (NSR), Distributed Speech Recognition (DSR), and Embedded Speech Recognition (ESR). |
Inaam F. Qutaiba I. Ali; | arxiv-cs.NI | 2025-02-10 |
| 392 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. |
MARDHIYAH SANNI et. al. | arxiv-cs.CL | 2025-02-06 |
| 393 | Integrating Automatic Speech Recognition Into Remote Healthcare Interpreting: A Pilot Study of Its Impact on Interpreting Quality Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. |
Shiyi Tan; Constantin Orăsan; Sabine Braun; | arxiv-cs.CL | 2025-02-05 |
| 394 | CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC’s scaling issues. |
MARTIJN BARTELDS et. al. | arxiv-cs.LG | 2025-02-03 |
| 395 | Exploring Discrete Speech Units for Privacy-preserving and Efficient Speech Recognition for School-aged and Preschool Children Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Satwik Dutta; Dwight Irvin; J. Hansen; | Int. J. Hum. Comput. Stud. | 2025-02-01 |
| 396 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. |
Anna Seo Gyeong Choi; Jonghyeon Park; Myungwoo Oh; | arxiv-cs.CL | 2025-02-01 |
| 397 | When End-to-End Is Overkill: Rethinking Cascaded Speech-to-Text Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. |
Anna Min; Chenxu Hu; Yi Ren; Hang Zhao; | arxiv-cs.CL | 2025-02-01 |
| 398 | Sagalee: An Open Source Automatic Speech Recognition Dataset for Oromo Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. |
Turi Abu; Ying Shi; Thomas Fang Zheng; Dong Wang; | arxiv-cs.CL | 2025-02-01 |
| 399 | Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. |
Zhengdong Yang; Qianying Liu; Sheng Li; Fei Cheng; Chenhui Chu; | arxiv-cs.CL | 2025-01-29 |
| 400 | AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Toaddress these, we introduce the AVE speech, a comprehensive multi-modal datasetfor speech recognition tasks. |
DONGLIANG ZHOU et. al. | arxiv-cs.SD | 2025-01-28 |
| 401 | The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates the prevalence and impact of ASR errors in medical transcription in Nigeria, the United Kingdom, and the United States. |
Ayo Adedeji; Mardhiyah Sanni; Emmanuel Ayodele; Sarita Joshi; Tobi Olatunji; | arxiv-cs.CL | 2025-01-25 |
| 402 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. |
XUELONG GENG et. al. | arxiv-cs.SD | 2025-01-22 |
| 403 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. |
Moreno La Quatra; Valerio Mario Salerno; Yu Tsao; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-01-22 |
| 404 | A Review on Speech Recognition Approaches and Challenges for Portuguese: Exploring The Feasibility of Fine-tuning Large-scale End-to-end Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: At present, automatic speech recognition has become an important bridge for human-computer interaction and is widely applied in multiple fields. The Portuguese speech recognition … |
Yan Li; Yapeng Wang; Lap-Man Hoi; Dingcheng Yang; Sio-Kei Im; | EURASIP J. Audio Speech Music. Process. | 2025-01-21 |
| 405 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
MATEUSZ BARAŃSKI et. al. | arxiv-cs.SD | 2025-01-20 |
| 406 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
Takaaki Hori; Martin Kocour; Adnan Haider; Erik McDermott; Xiaodan Zhuang; | arxiv-cs.CL | 2025-01-15 |
| 407 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
Natarajan Balaji Shankar; Zilai Wang; Eray Eren; Abeer Alwan; | arxiv-cs.CL | 2025-01-14 |
| 408 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. |
JILIANG HU et. al. | arxiv-cs.SD | 2025-01-13 |
| 409 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Nonetheless, the evaluation of multilingual SLUis limited to shallow tasks such as intent classification or languageidentification. This is why we present Fleurs-SLU, a multilingual SLU benchmarkthat encompasses (i) 692 hours of speech for topical utterance classificationin 102 languages and (ii) multiple-choice question answering via listeningcomprehension spanning 944 hours of speech across 92 languages. |
Fabian David Schmidt; Ivan Vulić; Goran Glavaš; David Ifeoluwa Adelani; | arxiv-cs.CL | 2025-01-10 |
| 410 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
Eklavya Sarkar; Mathew Magimai. -Doss; | arxiv-cs.LG | 2025-01-10 |
| 411 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work evaluates RoPE across diverse ASR tasks with training data ranging from 100 to 50,000 hours, covering various speech types (read, spontaneous, clean, noisy) and different accents in both streaming and non-streaming settings. |
Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav Bhattacharya; | arxiv-cs.CL | 2025-01-10 |
| 412 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Samba ASR,the first state of the art Automatic Speech Recognition(ASR)model leveraging the novel Mamba architecture as both encoder and decoder,built on the foundation of state space models(SSMs). |
Syed Abdul Gaffar Shakhadri; Kruthika KR; Kartik Basavaraj Angadi; | arxiv-cs.CL | 2025-01-06 |
| 413 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
ZHONGJIAN CUI et. al. | arxiv-cs.SD | 2025-01-05 |
| 414 | Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of “listening and seeing again”. |
Rui Liu; Hongyu Yuan; Haizhou Li; | arxiv-cs.MM | 2025-01-03 |
| 415 | Research on Digital Human Speech Recognition Method in High-Disturbance Industrial Environment Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The advent of industrial robotics and speech technology has precipitated a paradigm shift in the manner in which humans and machines collaborate. This article investigates the … |
Pengyu Zhu; Xiaobin Li; Haiyan Sun; Zhuoyi Chen; Jingsi Wang; | IEEE Transactions on Instrumentation and Measurement | 2025-01-01 |
| 416 | Transformer-Based Approach for Solving Mathematical Problems Using Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we introduce Vox Calculi, a system designed to solve mathematical problems using voice transcriptions. By leveraging state-of-the-art pretrained Automatic Speech … |
Ante Grgurević; Marina Bagić Babac; | IEEE Access | 2025-01-01 |
| 417 | Low-Resource Speech Recognition of Radiotelephony Communications Based on Continuous Learning of In-Domain and Out-of-Domain Knowledge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) in air traffic control (ATC) is a low-resource task with limited data and difficult annotation. Fine-tuning self-supervised pre-trained models … |
Guimin Jia; Dong He; Xilong Zhou; | IEEE Signal Processing Letters | 2025-01-01 |
| 418 | Code-Switching ASR for Low-Resource Indic Languages: A Hindi-Marathi Case Study Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This work examines the development of Automatic Speech Recognition (ASR) systems for low-resource languages, focusing on Hindi and Marathi, particularly in multilingual and … |
H. PALIVELA et. al. | IEEE Access | 2025-01-01 |
| 419 | A Review of Speech Recognition and Application to Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Eman Aboelela; Omar Mansour; | FICC | 2025-01-01 |
| 420 | Controllable Conformer for Speech Enhancement and Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We propose a novel approach to speech enhancement, termed Controllable ConforMer for Speech Enhancement (CCMSE), which leverages a Conformer-based architecture integrated with a … |
Zilu Guo; Jun Du; Sabato Marco Siniscalchi; Jia Pan; Qingfeng Liu; | IEEE Signal Processing Letters | 2025-01-01 |
| 421 | Relative Applicability of Diverse Automatic Speech Recognition Platforms for Transcription of Psychiatric Treatment Sessions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Service delivery in mental healthcare involves documentation of sensitive patient-clinician conversations that require serious caution. Conventionally, clinicians take handwritten … |
Rana Zeeshan; J. Bogue; Mamoona Naveed Asghar; | IEEE Access | 2025-01-01 |
| 422 | Leveraging Synthetic Data for Improved Manipuri-English Code-Switched ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Accurately recognizing code-switched speech presents a significant challenge in the field of Automatic speech recognition (ASR), particularly for low-resource regional languages. … |
Naorem Karline Singh; Wangkheimayum Madal; Chingakham Neeta Devi; Hoomexsun Pangsatabam; Y. J. Chanu; | IEEE Access | 2025-01-01 |
| 423 | Machine Learning and Deep Learning Approaches for Accent Recognition: A Review Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Accent recognition has attracted immense research interest owing to the advancements in automatic speech recognition (ASR) systems. Accent variations are an essential factor in … |
Muzaffar Ahmad Dar; Jagalingam Pushparaj; | IEEE Access | 2025-01-01 |
| 424 | Deep Learning-Based Coding Strategy for Improved Cochlear Implant Speech Perception in Noisy Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) and speech enhancement are essential tools in modern life, aiding not only in machine interaction but also in supporting individuals with … |
Billel Essaid; Hamza Kheddar; N. Batel; M. Chowdhury; | IEEE Access | 2025-01-01 |
| 425 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. |
Or Haim Anidjar; Revital Marbel; Roi Yozevitch; | arxiv-cs.CL | 2024-12-31 |
| 426 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
KAREL MUNDNICH et. al. | arxiv-cs.CL | 2024-12-24 |
| 427 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
JIAHUI ZHAO et. al. | arxiv-cs.CL | 2024-12-21 |
| 428 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
HE WANG et. al. | arxiv-cs.SD | 2024-12-17 |
| 429 | Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. |
Yerin Choi; Jeehyun Lee; Myoung-Wan Koo; | arxiv-cs.SD | 2024-12-04 |
| 430 | Maximizing The Capabilities of Tiny Speech Foundation Models in A Privacy Preserving Manner Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Voice assistive technologies have given rise to extensive privacy and security concerns. In this paper we investigate whether robust automatic speech recognition (ASR) can be done … |
A. Benazir; Felix Xiaozhu Lin; | Proceedings of the 30th Annual International Conference on … | 2024-12-04 |
| 431 | On-Device Speech Filtering for Privacy-Preserving Acoustic Activity Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Acoustic sensing has become increasingly prevalent for mobile and ambient devices for applications such as human activity recognition, health monitoring, and environmental … |
Haozhe Zhou; Sudershan Boovaraghavan; Mayank Goel; Yuvraj Agarwal; | Proceedings of the 30th Annual International Conference on … | 2024-12-04 |
| 432 | Multi-Modal Video Summarization Based on Two-Stage Fusion of Audio, Visual, and Recognized Text Information Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multi-modal video summarization task provides a text summary by combining information from different kinds of inputs. Previous methods usually generate the summary from video and … |
Zekun Yang; Jiajun He; T. Toda; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 433 | A Tiny Whisper-SER: Unifying Automatic Speech Recognition and Multi-label Speech Emotion Recognition Tasks Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech Emotion Recognition (SER) is critical in human-computer interaction (HCI). Previous SER works often utilize Automatic Speech Recognition (ASR) systems to improve SER … |
Huang-Cheng Chou; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 434 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. |
AOHAN ZENG et. al. | arxiv-cs.CL | 2024-12-03 |
| 435 | LLM As Decoder: Investigating Lattice-based Speech Recognition Hypotheses Rescoring Using LLM Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and … |
Sheng Li; Y. Ko; Akinori Ito; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 436 | Domain Adaptation By Alternating Learning of Acoustic and Linguistic Information for Japanese Deaf and Hard-of-Hearing People Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: More than half of Japanese people with hearing impairments communicate using speech, however speech recognition systems trained using speech from individuals with normal hearing … |
Kaito Takahashi; Yukoh Wakabayashi; Kengo Ohta; Akio Kobayashi; N. Kitaoka; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 437 | Data Selection Using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Large-scale pre-trained models have become common for Automatic Speech Recognition (ASR) tasks. They utilize large-scale, multilingual datasets to learn acoustic features and then … |
Jianan Chen; Chenhui Chu; Sheng Li; Tatsuya Kawahara; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 438 | A Comparative Study on The Biases of Age, Gender, Dialects, and L2 Speakers of Automatic Speech Recognition for Korean Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advancements in the field of Automatic Speech Recognition (ASR) have seen the emergence of large-scale models, contributing to a surge in research and development. The … |
Jonghwan Na; Yeseul Park; Bowon Lee; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 439 | Adversarial Augmentation and Adaptation for Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: It is crucial to conduct parameter efficient learning to adapt a large-scaled pre-trained backbone model to a downstream task where the desirable performance could be achieved for … |
Jen-Tzung Chien; Wei-Yu Sun; | 2024 Asia Pacific Signal and Information Processing … | 2024-12-03 |
| 440 | Lite ASR Transformer: A Light Weight Transformer Architecture For Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Transformers are popular sequence-to-sequence models but have large number of parameters and high compute requirements. As an initiative to reduce the energy demand by Transformer … |
Narla John Metilda Sagaya Mary; S. Umesh; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
| 441 | Advancing CTC Models for Better Speech Alignment: A Topological Approach Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems often face challenges in alignment quality, particularly with the Connectionist Temporal Classification (CTC) approach, which frequently … |
Zeyu Zhao; Peter Bell; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
| 442 | Floras 50: A Massively Multilingual Multitask Benchmark for Long-Form Conversational Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: A common criticism for current speech recognition benchmarks is the reliance on settings which do not generalize well to real-world conversational environments, such as read … |
William Chen; Brian Yan; Chih-Chen Chen; Shinji Watanabe; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
| 443 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Multi-Task Automatic Speech Recognition Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly … |
Vyas Raina; Mark J. F. Gales; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
| 444 | Enhanced ASR FOR Stuttering Speech: Combining Adversarial and Signal-Based Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper presents our submission to the SLT2024 StutteringSpeech Challenge, focusing on augmenting stuttering data using straightforward and effective techniques. We combined … |
Shangkun Huang; Dejun Zhang; Jing Deng; Rong Zheng; | 2024 IEEE Spoken Language Technology Workshop (SLT) | 2024-12-02 |
| 445 | Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer‐Based Approaches and Late Fusion Strategies Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel … |
Sunakshi Mehra; Virender Ranga; Ritu Agarwal; | Computational Intelligence | 2024-12-01 |
| 446 | SpeeDF – A Speech De-Identification Framework Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper proposes SpeeDF, a novel three-step framework for anonymizing speech data, particularly focusing on Singaporean English (Singlish). SpeeDF tackles the challenge of … |
C. S. Veerappan; Priyanshu Dhingra; Daniel Zhengkui Wang; Rong Tong; | TENCON 2024 – 2024 IEEE Region 10 Conference (TENCON) | 2024-12-01 |
| 447 | A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. |
Zheshu Song; Ziyang Ma; Yifan Yang; Jianheng Zhuo; Xie Chen; | arxiv-cs.AI | 2024-12-01 |
| 448 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). |
GEOFFREY TYNDALL et. al. | arxiv-cs.CL | 2024-11-27 |
| 449 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.CL | 2024-11-27 |
| 450 | How to Learn A New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. |
SHIH-HENG WANG et. al. | arxiv-cs.SD | 2024-11-27 |
| 451 | Aligning Pre-trained Models for Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). |
Šimon Sedláček; Santosh Kesiraju; Alexander Polok; Jan Černocký; | arxiv-cs.CL | 2024-11-27 |
| 452 | Comparative Analysis of ASR Methods for Speech Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. |
DAVIDE SALVI et. al. | arxiv-cs.SD | 2024-11-26 |
| 453 | High-precision Medical Speech Recognition Through Synthetic Data and Semantic Correction: UNITED-MEDASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet … |
Sourav Banerjee; Ayushi Agarwal; Promila Ghosh; | ArXiv | 2024-11-24 |
| 454 | Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on Edge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized … |
RUIYANG QIN et. al. | 2025 IEEE/ACM International Conference On Computer Aided … | 2024-11-21 |
| 455 | CAFE A Novel Code Switching Dataset for Algerian Dialect French and English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The paper introduces and publicly releases (Data download link available after acceptance) CAFE — the first Code-switching dataset between Algerian dialect, French, and english languages. |
HOUSSAM EDDINE-OTHMAN LACHEMAT et. al. | arxiv-cs.SD | 2024-11-20 |
| 456 | From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) technology has witnessed significant advancements in recent years, revolutionizing human-computer interactions. While major languages have … |
Muhammad Sharif; Zeeshan Abbas; Jiangyan Yi; Chenglin Liu; | ArXiv | 2024-11-20 |
| 457 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR Using Zero-Shot TTS and LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. |
JIAWEI YU et. al. | arxiv-cs.CL | 2024-11-20 |
| 458 | Disfluency Detection and Removal in Speech Transcriptions Via Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The field of Automatic Speech Recognition (ASR) has significantly expanded within the technological landscape due to its extensive use in sectors such as education, healthcare, … |
Pedro L. S. de Lima; C. E. C. Campelo; | Brazilian Symposium in Information and Human Language … | 2024-11-17 |
| 459 | BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. |
MD. NAZMUS SADAT SAMIN et. al. | arxiv-cs.CL | 2024-11-16 |
| 460 | Interactive Cycle Model: The Linkage Combination Among Automatic Speech Recognition, Large Language Models and Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This research proposes the interaction loop model ASR-LLMs-Smart Glasses, which model combines automatic speech recognition, large language model and smart glasses to facilitate seamless human-computer interaction. |
Libo Wang; | arxiv-cs.HC | 2024-11-15 |
| 461 | Everyone Deserves Their Voice to Be Heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. |
Rik Raes; Saskia Lensink; Mykola Pechenizkiy; | arxiv-cs.CL | 2024-11-14 |
| 462 | Task Arithmetic Can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task arithmetic is effective at mitigating this gap. |
Hsuan Su; Hua Farn; Fan-Yun Sun; Shang-Tse Chen; Hung-yi Lee; | emnlp | 2024-11-11 |
| 463 | BLSP-Emo: Towards Empathetic Large Speech-Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. |
CHEN WANG et. al. | emnlp | 2024-11-11 |
| 464 | Advancing Test-Time Adaptation in Wild Acoustic Test Settings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. |
Hongfu Liu; Hengguan Huang; Ye Wang; | emnlp | 2024-11-11 |
| 465 | Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. |
Giuseppe Attanasio; Beatrice Savoldi; Dennis Fucci; Dirk Hovy; | emnlp | 2024-11-11 |
| 466 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | emnlp | 2024-11-11 |
| 467 | Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. |
YEONJOON JUNG et. al. | emnlp | 2024-11-11 |
| 468 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; | emnlp | 2024-11-11 |
| 469 | Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). |
Yoshiki Masuyama; Koichi Miyazaki; Masato Murata; | arxiv-cs.SD | 2024-11-11 |
| 470 | VHASR: A Multimodal Speech Recognition System With Vision Hotwords Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. |
JILIANG HU et. al. | emnlp | 2024-11-11 |
| 471 | Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Following this framework, we introduce Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR. |
Guan-Ting Lin; Wei Ping Huang; Hung-yi Lee; | emnlp | 2024-11-11 |
| 472 | ConMamba: A Convolution-Augmented Mamba Encoder Model for Efficient End-to-End ASR Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: End-to-End Automatic Speech Recognition (ASR) models, such as Conformer, excel in accuracy but face limitations in computational complexity and positional awareness, hindering … |
Haoxiang Hou; Xun Gong; Yanmin Qian; | 2024 IEEE 14th International Symposium on Chinese Spoken … | 2024-11-07 |
| 473 | Adversarial Attack and Defense for Commercial Black-box Chinese-English Speech Recognition Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The attacker can generate adversarial examples (AEs) to stealthily mislead automatic speech recognition (ASR) models, raising significant concerns about the security of … |
XUEJING YUAN et. al. | ACM Transactions on Privacy and Security | 2024-11-07 |
| 474 | Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer’s Disease Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) plays an important role in speech-based automatic detection of Alzheimer’s disease (AD). However, recognition errors could propagate downstream, … |
Jiawen Kang; Junan Li; Jinchao Li; Xixin Wu; Helen M. Meng; | 2024 IEEE 14th International Symposium on Chinese Spoken … | 2024-11-07 |
| 475 | Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. |
Leena G Pillai; Kavya Manohar; Basil K Raju; Elizabeth Sherly; | arxiv-cs.CL | 2024-11-07 |
| 476 | Dialectal Coverage And Generalization in Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. |
Amirbek Djanibekov; Hawau Olamide Toyin; Raghad Alshalan; Abdullah Alitr; Hanan Aldarmaki; | arxiv-cs.CL | 2024-11-07 |
| 477 | Ensemble Knowledge Distillation from Speech SSL Models Considering Inter-Teacher Differences Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In the realm of speech processing, Self-Supervised Learning (SSL) models such as HuBERT are widely used in various speech tasks such as Automatic Speech Recognition (ASR) and … |
Pei-Jun Liao; Hung-Yi Lee; Hsin-Min Wang; | 2024 IEEE 14th International Symposium on Chinese Spoken … | 2024-11-07 |
| 478 | Dual-Strategy Fusion Method in Noise-Robust Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems have become in-tegral to various aspects of people’s lives. However, the presence of noise in real-world environments often affects ASR … |
Jiahao Li; Cunhang Fan; Enrui Liu; Jian Zhou; Zhao Lv; | 2024 IEEE 14th International Symposium on Chinese Spoken … | 2024-11-07 |
| 479 | Enhancing AAC Software for Dysarthric Speakers in E-Health Settings: An Evaluation Using TORGO Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. |
Macarious Hui; Jinda Zhang; Aanchan Mohan; | arxiv-cs.CL | 2024-11-01 |
| 480 | CIEASR:Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce CIEASR (Contextual Image-Enhanced Automatic Speech Recognition), a novel multimodal speech recognition model that incorporates a new cue fusion method, using scene images as soft prompts to correct homophone errors. |
ZIYI WANG et. al. | mm | 2024-10-30 |
| 481 | Simultaneous Speech and Eating Behavior Recognition Using Multitask Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The importance of talking and eating in preserving one’s well-being has been underscored. Automatic recognition of daily conversation and eating behavior has promising … |
Toshihiro Tsukagoshi; Masafumi Nishida; Masafumi Nishimura; | 2024 IEEE 13th Global Conference on Consumer Electronics … | 2024-10-29 |
| 482 | Evaluation of Speech Translation Subtitles Generated By ASR with Unnecessary Word Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This study addresses the problem of generating understandable speech translation subtitles for spontaneous speech, such as lectures and talks, which often contain disfluencies … |
Makoto Hotta; Chee Siang Leow; N. Kitaoka; Hiromitsu Nishizaki; | 2024 IEEE 13th Global Conference on Consumer Electronics … | 2024-10-29 |
| 483 | Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a jointbeamforming and SA-ASR approach for real meeting transcription. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2024-10-29 |
| 484 | Speech Recognition for A Person With Cerebral Palsy Using Whisper Fine-Tuned on Japanese and English Dysarthric Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: People with cerebral palsy often have dysarthria, and this makes it hard for them to speak as they wish. In this paper, we present an automatic speech recognition (ASR) model for … |
Kirito Haze; R. Takashima; Tetsuya Takiguchi; | 2024 IEEE 13th Global Conference on Consumer Electronics … | 2024-10-29 |
| 485 | Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach that first refines all available transcriptions to ensure data reliability. |
Enshi Zhang; Christian Poellabauer; | arxiv-cs.CL | 2024-10-27 |
| 486 | Self-supervised Learning Using Unlabeled Speech with Multiple Types of Speech Disorder for Disordered Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper investigates a training method of an automatic speech recognition (ASR) model for people with speech disorders. Because the characteristics of their speech differ … |
R. Takashima; Takeru Otani; Ryo Aihara; Tetsuya Takiguchi; Shinya Taguchi; | Proceedings of the 26th International ACM SIGACCESS … | 2024-10-27 |
| 487 | Improving End-to-End Speech Recognition for Dysarthric Speech Through In-Domain Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses … |
Paban Sapkota; H. Kathania; Sudarsana Reddy Kadiri; Shrikanth S. Narayanan; | 2024 58th Asilomar Conference on Signals, Systems, and … | 2024-10-27 |
| 488 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. |
Hawau Olamide Toyin; Hao Li; Hanan Aldarmaki; | arxiv-cs.CL | 2024-10-24 |
| 489 | Evaluating Automatic Speech Recognition Systems for Korean Meteorological Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our contributions include creating a domain-specific dataset,comprehensive ASR model evaluations, and an effective augmentation technique.We believe our work provides a foundation for future advancements in ASR forthe Korean weather forecasting domain. |
ChaeHun Park; Hojun Cho; Jaegul Choo; | arxiv-cs.CL | 2024-10-24 |
| 490 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
Suryoday Basak; Abhijeeth Padarthi; Mahanth Gowda; | arxiv-cs.SD | 2024-10-22 |
| 491 | DENOASR: Debiasing ASRs Through Selective Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. |
Anand Kumar Rai; Siddharth D Jaiswal; Shubham Prakash; Bendi Pragnya Sree; Animesh Mukherjee; | arxiv-cs.SD | 2024-10-22 |
| 492 | VoiceBench: Benchmarking LLM-Based Voice Assistants IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field. |
YIMING CHEN et. al. | arxiv-cs.CL | 2024-10-22 |
| 493 | Using Automatic Speech Recognition for Speech Comprehension Evaluation in The Cochlear Implant Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The cochlear implant (CI) is a sophisticated electronic device designed to partially restore hearing for individuals with severe-to-profound hearing loss. To assess speech … |
Hsin-Li Chang; Enoch Hsin-Ho Huang; Yi-Ching Wang; Yu Tsao; | 2024 27th Conference of the Oriental COCOSDA International … | 2024-10-17 |
| 494 | Multilingual Speech Translator for Medical Consultation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The speaking at cross purposes is one of the common problems in elderly care scenarios, especially the cross-language conversation in medical consultations. Constructing a … |
Zhe-Jia Xu; Yeou-Jiunn Chen; Qian-Bei Hong; | 2024 27th Conference of the Oriental COCOSDA International … | 2024-10-17 |
| 495 | Comprehensive Benchmarking and Analysis of Open Pretrained Thai Speech Recognition Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper presents a comprehensive benchmarking and analysis of open pretrained Thai Automatic Speech Recognition (ASR) models, addressing a critical gap in low-resource language … |
Pattara Tipakasorn; Oatsada Chatthong; Ren Yonehana; K. Thangthai; | 2024 27th Conference of the Oriental COCOSDA International … | 2024-10-17 |
| 496 | Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | arxiv-cs.CL | 2024-10-17 |
| 497 | Speech Recognition Models in Assisting Medical History Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper addresses challenges highlighted by health professionals, where up to 50\% of a medical consultation’s time is spent on history creation. To streamline this process, we … |
YANNA TORRES GONÇALVES et. al. | Brazilian Symposium on Databases | 2024-10-14 |
| 498 | Investigation of Speaker Representation for Target-Speaker Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? |
TAKANORI ASHIHARA et. al. | arxiv-cs.SD | 2024-10-14 |
| 499 | Automatic Speech Recognition with BERT and CTC Transformers: A Review IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: All in all, this review provides valuable insights for researchers and practitioners who are interested in ASR with BERT and CTC transformers. |
Noussaiba Djeffal; Hamza Kheddar; Djamel Addou; Ahmed Cherif Mazari; Yassine Himeur; | arxiv-cs.CL | 2024-10-12 |
| 500 | Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. |
AULIA ADILA et. al. | arxiv-cs.CL | 2024-10-11 |
| 501 | Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we hypothesize that incorporating speaker representations during speech recognition can enhance model robustness to noise. |
Sagarika Alavilli; Annesya Banerjee; Gasser Elbanna; Annika Magaro; | arxiv-cs.SD | 2024-10-07 |
| 502 | Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for The Polish Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. |
Michał Junczyk; | nips | 2024-10-07 |
| 503 | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | nips | 2024-10-07 |
| 504 | REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. |
LIANG-HSUAN TSENG et. al. | nips | 2024-10-07 |
| 505 | Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Large language models (LLMs) have started to play a vital role in modelling speech and text. |
Pavel Stepachev; Pinzhen Chen; Barry Haddow; | arxiv-cs.CL | 2024-10-04 |
| 506 | Reverb: Open-Source ASR and Diarization from Rev Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as … |
NISHCHAL BHANDARI et. al. | arxiv-cs.CL | 2024-10-04 |
| 507 | Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). |
Olga Iakovenko; Ivan Bondarenko; | arxiv-cs.SD | 2024-10-03 |
| 508 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. |
Olga Iakovenko; Ivan Bondarenko; Mariya Borovikova; Daniil Vodolazsky; | arxiv-cs.CL | 2024-10-03 |
| 509 | Neural End-To-End Speech Translation Leveraged By ASR Posterior Distribution Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: SUMMARY End-to-end speech translation (ST) directly renders source language speech to the target language without intermediate automatic speech recognition (ASR) output as in a … |
Yuka Ko; Katsuhito Sudoh; S. Sakti; Satoshi Nakamura; | IEICE Trans. Inf. Syst. | 2024-10-01 |
| 510 | Automatic Speech Recognition for The Ika Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a cost-effective approach for developing Automatic Speech Recognition (ASR) models for low-resource languages like Ika. |
Uchenna Nzenwata; Daniel Ogbuigwe; | arxiv-cs.CL | 2024-10-01 |
| 511 | The OCON Model: An Old But Green Solution for Distributable Supervised Classification for Acoustic Monitoring in Smart Cities Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper explores a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks, focusing on vowel phonemes … |
Stefano Giacomelli; Marco Giordano; C. Rinaldi; | 2024 IEEE 5th International Symposium on the Internet of … | 2024-09-30 |
| 512 | AfriHuBERT: A Self-supervised Speech Representation Model for African Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. |
Jesujoba O. Alabi; Xuechen Liu; Dietrich Klakow; Junichi Yamagishi; | arxiv-cs.CL | 2024-09-30 |
| 513 | Development of Speech Recognition in Wireless Mobile Networks for An Intelligent Learning System in Language Education Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Communication has been crucial to human existence, society, and globalization for millennia. Speech Recognition (SR) technologies include biometric evaluation, security, safety, … |
NARGIS KURBANAZAROVA et. al. | J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl. | 2024-09-30 |
| 514 | Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through this review, we advocate for ashift towards more sophisticated, adaptive clustering techniques that cansignificantly improve speech enhancement and pave the way for more resilientspeech processing systems. |
Abdulhady Abas Abdullah; Aram Mahmood Ahmed; Tarik Rashid; Hadi Veisi; | arxiv-cs.SD | 2024-09-28 |
| 515 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
Brian Yan; Vineel Pratap; Shinji Watanabe; Michael Auli; | arxiv-cs.CL | 2024-09-26 |
| 516 | Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. |
Robin Shing-Hei Yuen; Timothy Tin-Long Tse; Jian Zhu; | arxiv-cs.CL | 2024-09-25 |
| 517 | MT2KD: Toward A General-Purpose Encoder for Speech, Speaker, and Audio Events Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With advances in deep learning, the performance of end-to-end single-task models for speech and audio processing has been constantly improving. However, it is challenging to build … |
Xiaoyu Yang; Qiujia Li; C. Zhang; P. Woodland; | IEEE Transactions on Audio, Speech and Language Processing | 2024-09-25 |
| 518 | Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. |
Andrés Piñeiro-Martín; Carmen García-Mateo; Laura Docío-Fernández; María del Carmen López-Pérez; Georg Rehm; | arxiv-cs.CL | 2024-09-25 |
| 519 | Spelling Correction Through Rewriting of Non-Autoregressive ASR Lattices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. |
LEONID VELIKOVICH et. al. | arxiv-cs.CL | 2024-09-24 |
| 520 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). |
FENGRUN ZHANG et. al. | arxiv-cs.SD | 2024-09-24 |
| 521 | Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a novel training approach to enhance LLM performance in ASR tasks. |
Yang Yuhang; Peng Yizhou; Eng Siong Chng; Xionghu Zhong; | arxiv-cs.CL | 2024-09-24 |
| 522 | MultiMed: Multilingual Medical Speech Recognition Via Attention Encoder Decoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. |
KHAI LE-DUC et. al. | arxiv-cs.CL | 2024-09-21 |
| 523 | Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation with Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). |
IULIIA THORBECKE et. al. | arxiv-cs.CL | 2024-09-20 |
| 524 | A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. |
Georgios Sidiropoulos; Evangelos Kanoulas; | arxiv-cs.CL | 2024-09-20 |
| 525 | Unifying Global and Near-Context Biasing in A Single Trie Pass Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the success of end-to-end automatic speech recognition (ASR) models,challenges persist in recognizing rare, out-of-vocabulary words – includingnamed entities (NE) – and in adapting to new domains using only text data. Thiswork presents a practical approach to address these challenges through anunexplored combination of an NE bias list and a word-level n-gram languagemodel (LM). |
IULIIA THORBECKE et. al. | arxiv-cs.CL | 2024-09-20 |
| 526 | Examining Test-Time Adaptation for Personalized Child Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate theeffectiveness of two widely used TTA methods-SUTA, SGEM-in adaptingoff-the-shelf ASR models and their fine-tuned versions for child speechrecognition, with the goal of enabling continuous, unsupervised adaptation attest time. |
ZHONGHAO SHI et. al. | arxiv-cs.LG | 2024-09-19 |
| 527 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. |
Sebastião Quintas; Isabelle Ferrané; Thomas Pellegrini; | arxiv-cs.SD | 2024-09-19 |
| 528 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. |
Gaurav Maheshwari; Dmitry Ivanov; Théo Johannet; Kevin El Haddad; | arxiv-cs.CL | 2024-09-18 |
| 529 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
UMBERTO CAPPELLAZZO et. al. | arxiv-cs.CV | 2024-09-18 |
| 530 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2024-09-18 |
| 531 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
CHIEN-CHUN WANG et. al. | arxiv-cs.SD | 2024-09-18 |
| 532 | Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a speech generation system that simulates the L1 shadowing process using voice conversion (VC) techniques and latent speech representations. |
Haopeng Geng; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2024-09-18 |
| 533 | WER We Stand: Benchmarking Urdu ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. |
SAMEE ARIF et. al. | arxiv-cs.CL | 2024-09-17 |
| 534 | Chain-of-Thought Prompting for Speech Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
KE HU et. al. | arxiv-cs.CL | 2024-09-17 |
| 535 | Speech Recognition for Analysis of Police Radio Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. |
Tejes Srivastava; Ju-Chieh Chou; Priyank Shroff; Karen Livescu; Christopher Graziul; | arxiv-cs.SD | 2024-09-16 |
| 536 | Augmenting Automatic Speech Recognition Models with Disfluency Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. |
Robin Amann; Zhaolin Li; Barbara Bruno; Jan Niehues; | arxiv-cs.CL | 2024-09-16 |
| 537 | Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. |
CHAO-HAN HUCK YANG et. al. | arxiv-cs.CL | 2024-09-15 |
| 538 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. |
SHAOJUN LI et. al. | arxiv-cs.SD | 2024-09-13 |
| 539 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
LINGWEI MENG et. al. | arxiv-cs.CL | 2024-09-13 |
| 540 | Exploring SSL Discrete Tokens for Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. |
MINGYU CUI et. al. | arxiv-cs.CL | 2024-09-13 |
| 541 | CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. |
Ahmed Adel Attia; Dorottya Demszky; Tolulope Ogunremi; Jing Liu; Carol Espy-Wilson; | arxiv-cs.CL | 2024-09-13 |
| 542 | Exploring The Impact of Data Quantity on ASR in Extremely Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a noveldata-selection scheme leveraging a multilingual corpus to augment the limitedtarget language data. |
Yao-Fei Cheng; Li-Wei Chen; Hung-Shin Lee; Hsin-Min Wang; | arxiv-cs.CL | 2024-09-13 |
| 543 | M$^{3}$V: A Multi-modal Multi-view Approach for Device-Directed Speech Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR). To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. |
ANNA WANG et. al. | arxiv-cs.SD | 2024-09-13 |
| 544 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-09-12 |
| 545 | WhisperNER: Unified Open Named Entity and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inthis paper, we introduce WhisperNER, a novel model that allows joint speechtranscription and entity recognition. |
GIL AYACHE et. al. | arxiv-cs.CL | 2024-09-12 |
| 546 | The Faetar Benchmark: Speech Recognition in A Very Under-Resourced Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. |
MICHAEL ONG et. al. | arxiv-cs.CL | 2024-09-12 |
| 547 | Enhancing CTC-Based Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). |
Hendrik Laux; Anke Schmeink; | arxiv-cs.CV | 2024-09-11 |
| 548 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Batthacharya; | arxiv-cs.SD | 2024-09-11 |
| 549 | A Large Dataset of Spontaneous Speech with The Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the … |
R. Lima; Sidney Evaldo Leal; Arnaldo Cândido Júnior; S. Alu’isio; | ArXiv | 2024-09-10 |
| 550 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. |
Jihyun Lee; Solee Im; Wonjun Lee; Gary Geunbae Lee; | arxiv-cs.CL | 2024-09-10 |
| 551 | Assessing Latency in ASR Systems: A Methodological Perspective for Real-Time Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The user-perceived latency of ASRsystems differs from that of interpretation because it measures the timebetween speech and transcription delivery. To address this, we propose a newapproach to measuring delay in ASR systems and validate if they are usable inlive interpretation scenarios. |
Carlos Arriaga; Alejandro Pozo; Javier Conde; Alvaro Alonso; | arxiv-cs.SD | 2024-09-09 |
| 552 | Quantification of Stylistic Differences in Human- and ASR-produced Transcripts of African American English Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). |
ANNIKA HEUSER et. al. | arxiv-cs.CL | 2024-09-04 |
| 553 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; Elizabeth Sherly; | arxiv-cs.CL | 2024-09-04 |
| 554 | Contemplative Mechanism for Speech Recognition: Speech Encoders Can Think Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Encoders are crucial for speech recognition, and boosting their computation enhances feature quality. However, the common practice of scaling model size to achieve this is … |
Tien-Ju Yang; Andrew Rosenberg; B. Ramabhadran; | Interspeech 2024 | 2024-09-01 |
| 555 | Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating The Impact of Healthy, Disease-Specific, and Speaker-Specific Data Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Despite significant advancements in automatic speech recognition technology (ASR) the performance of such systems on dysarthric speech is still inadequate for widespread use. One … |
Spyretta Leivaditi; Tatsunari Matsushima; Matt Coler; Shekhar Nayak; V. Verkhodanova; | Interspeech 2024 | 2024-09-01 |
| 556 | Self-training ASR Guided By Unsupervised ASR Teacher Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Self-training has gained increasing attention due to its notable performance improvement in speech recognition. However, conventional self-training techniques have two key … |
HYUNG YONG KIM et. al. | Interspeech 2024 | 2024-09-01 |
| 557 | Comparing Discrete and Continuous Space LLMs for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. |
Yaoxun Xu; Shi-Xiong Zhang; Jianwei Yu; Zhiyong Wu; Dong Yu; | arxiv-cs.CL | 2024-09-01 |
| 558 | Reducing Speech Distortion and Artifacts for Speech Enhancement By Loss Function Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Deep learning-based speech enhancement has made significant strides. However, challenges such as speech distortion and artifacts persist. These issues can diminish perceived … |
HAIXING GUAN et. al. | Interspeech 2024 | 2024-09-01 |
| 559 | Speech and Language Recognition with Low-rank Adaptation of Pretrained Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Finetuning large pretrained models demands considerable computational resources, posing practical constraints. Majority of the total number of parameters in these models are used … |
Amrutha Prasad; S. Madikeri; Driss Khalil; P. Motlícek; Christof Schuepbach; | Interspeech 2024 | 2024-09-01 |
| 560 | Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictor Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent years, advancements in automatic speech recognition (ASR) systems have led to their widespread use in applications such as call center bots and virtual assistants. … |
Nguyen Manh Tien Anh; Thach Ho Sy; | Interspeech 2024 | 2024-09-01 |
| 561 | Evaluating Speech Recognition Performance Towards Large Language Model Based Voice Assistants Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent years, there has been a rise in the popularity of large language model (LLM) based voice assistants. A practical question being raised in the evaluation of cascaded … |
Zhe Liu; Suyoun Kim; Ozlem Kalinli; | Interspeech 2024 | 2024-09-01 |
| 562 | All Ears: Building Self-Supervised Learning Based ASR Models for Indian Languages at Scale Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The abundance of unlabeled speech and its ease of collection calls for the development of self-supervised learning (SSL) based speech foundation models, which have been effective … |
V. Lodagala; Abhishek Biswas; Shoutrik Das; Jordan Fernandes; S. Umesh; | Interspeech 2024 | 2024-09-01 |
| 563 | No-Reference Speech Intelligibility Prediction Leveraging A Noisy-Speech ASR Pre-Trained Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advances in deep learning have improved the capabilities of data-driven speech intelligibility prediction (SIP) algorithms. Nevertheless, the scarcity of speech … |
Haolan Wang; Amin Edraki; Wai-Yip Chan; Iván López-Espejo; Jesper Jensen; | Interspeech 2024 | 2024-09-01 |
| 564 | Speech Recognition for Greek Dialects: A Challenging Benchmark Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Language technologies should be judged on their usefulness in real-world use cases. Despite recent impressive progress in automatic speech recognition (ASR), an often overlooked … |
SOCRATES VAKIRTZIAN et. al. | Interspeech 2024 | 2024-09-01 |
| 565 | Speaker Personalization for Automatic Speech Recognition Using Weight-Decomposed Low-Rank Adaptation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Personalizing automated speech recognition (ASR) for voice assistant systems is often considered the holy grail, requiring meticulous attention to detail in model optimization. … |
George Joseph; Arun Baby; | Interspeech 2024 | 2024-09-01 |
| 566 | SEQ-former: A Context-enhanced and Efficient Automatic Speech Recognition Framework Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Contextual information is crucial for automatic speech recognition (ASR). E ff ective utilization of contextual information can improve the accuracy of ASR systems. To improve the … |
QINGLIN MENG et. al. | Interspeech 2024 | 2024-09-01 |
| 567 | Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Today’s end-to-end (E2E) ASR models achieve strong performance when applied to adult speech, but deteriorate on children’s speech. Most E2E ASR models are pre-trained on adult … |
Thomas Graave; Zhengyang Li; Timo Lohrenz; Tim Fingscheidt; | Interspeech 2024 | 2024-09-01 |
| 568 | Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children’s Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) often faces challenges in processing children’s speech due to data scarcity. Training large ASR models becomes particularly challenging in such … |
Thomas Rolland; Alberto Abad; | Interspeech 2024 | 2024-09-01 |
| 569 | Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children’s Automatic Speech Recognition Adaptation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) encounters unique challenges when dealing with children’s speech, mainly due to the scarcity of available data. Training large ASR models with … |
Thomas Rolland; Alberto Abad; | Interspeech 2024 | 2024-09-01 |
| 570 | Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. |
Hao Shi; Yuan Gao; Zhaoheng Ni; Tatsuya Kawahara; | arxiv-cs.SD | 2024-09-01 |
| 571 | LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. |
ZENGRUI JIN et. al. | arxiv-cs.SD | 2024-09-01 |
| 572 | Context-Aware Speech Recognition Using Prompts for Language Learners Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We aim to enhance automatic speech recognition (ASR) systems with context-aware prompts, improving accuracy without needing complex domain-specific language models or fine-tuning. … |
Jian Cheng; | Interspeech 2024 | 2024-09-01 |
| 573 | Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In real-world applications, a preferred personalized automatic speech recognition (ASR) model should exhibit robust recognition capabilities that include both personalization and … |
Yue Gu; Zhihao Du; Shiliang Zhang; Jiqing Han; Yongjun He; | Interspeech 2024 | 2024-09-01 |
| 574 | Prompt Tuning for Speech Recognition on Unknown Spoken Name Entities Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper explores the challenge of recognising relevant but previously unheard named entities in spoken input. This scenario pertains to real-world applications where … |
Xizi Wei; Stephen McGregor; | Interspeech 2024 | 2024-09-01 |
| 575 | Revisiting Convolution-free Transformer for Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Convolution augmented Transformer architectures have dominated the field of automatic speech recognition by showing better WER results when the models are trained on relatively … |
ZEJIANG HOU et. al. | Interspeech 2024 | 2024-09-01 |
| 576 | Leveraging Phonemic Transcription and Whisper Toward Clinically Significant Indices for Automatic Child Speech Assessment Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Diagnosing speech sound disorders (SSD) in children requires professional assessment by speech-language pathologists. Detecting and diagnosing a medical condition takes time and … |
Yeh-Sheng Lin; Shu-Chuan Tseng; Jyh-Shing Roger Jang; | Interspeech 2024 | 2024-09-01 |
| 577 | DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement Network Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The advancements in speech recognition have led to significant progress in predicting clean speech. However, challenges persist in real-world noisy environments. Addressing issues … |
WENJUN WANG et. al. | Interspeech 2024 | 2024-09-01 |
| 578 | ProGRes: Prompted Generative Rescoring on ASR N-Best Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. |
Ada Defne Tur; Adel Moumen; Mirco Ravanelli; | arxiv-cs.CL | 2024-08-30 |
| 579 | Measuring The Accuracy of Automatic Speech Recognition Solutions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. |
Korbinian Kuhn; Verena Kersken; Benedikt Reuter; Niklas Egger; Gottfried Zimmermann; | arxiv-cs.CL | 2024-08-29 |
| 580 | Speech Recognition Transformers: Topological-lingualism Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The paper presents a comprehensive survey of transformer techniques oriented in speech modality. |
Shruti Singh; Muskaan Singh; Virender Kadyan; | arxiv-cs.CL | 2024-08-27 |
| 581 | Audio-Visual Speech Recognition for Human-Robot Interaction: A Feasibility Study Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent models for Visual Speech Recognition (VSR) have shown remarkable progress over the last few years. They have however been applied mainly to datasets such as Lip Reading … |
Sander Goetzee; Konstantin Mihhailov; Roel Van De Laar; Kim Baraka; Koen V. Hindriks; | 2024 33rd IEEE International Conference on Robot and Human … | 2024-08-26 |
| 582 | Self-supervised Speech Representations Still Struggle with African American Vernacular English Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. |
KALVIN CHANG et. al. | arxiv-cs.CL | 2024-08-26 |
| 583 | A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition Applied on French Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The performance of end-to-end automatic speech recognition (ASR) systems enables their increasing integration into numerous applications. While there are various benefits to such … |
Thibault Bañeras-Roux; Mickael Rouvier; Jane Wottawa; Richard Dufour; | 2024 32nd European Signal Processing Conference (EUSIPCO) | 2024-08-26 |
| 584 | Developing Vocal System Impaired Patient-aimed Voice Quality Assessment Approach Using ASR Representation-included Multiple Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. |
SHAOXIANG DANG et. al. | arxiv-cs.SD | 2024-08-22 |
| 585 | Towards Measuring Fairness in Speech Recognition: Fair-Speech Dataset IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. |
IRINA-ELENA VELICHE et. al. | arxiv-cs.AI | 2024-08-22 |
| 586 | The State of Commercial Automatic French Legal Speech Recognition Systems and Their Impact on Court Reporters Et Al Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We benchmark three ASR models, including commercial and open-source options, on their ability to recognize French legal speech using a curated dataset. Our study evaluates the performance of these systems using the Word Error Rate (WER) metric and introduces the Sonnex Distance to account for phonetic accuracy. |
Nicolad Garneau; Olivier Bolduc; | arxiv-cs.CL | 2024-08-21 |
| 587 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. |
Prashant Serai; Peidong Wang; Eric Fosler-Lussier; | arxiv-cs.AI | 2024-08-20 |
| 588 | Error-preserving Automatic Speech Recognition of Young English Learners� Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the mistakes made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their mistakes. |
JANICK MICHOT et. al. | acl | 2024-08-20 |
| 589 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; J�r�me Goulian; Benjamin Lecouteux; | acl | 2024-08-20 |
| 590 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. |
SHAOLEI ZHANG et. al. | acl | 2024-08-20 |
| 591 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn�t Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | acl | 2024-08-20 |
| 592 | XCB: An Effective Contextual Biasing Approach to Bias Cross-lingual Phrases in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. |
Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; | arxiv-cs.CL | 2024-08-20 |
| 593 | Parameter-Efficient Transfer Learning Under Federated Learning for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This work explores the challenge of enhancing Automatic Speech Recognition (ASR) model performance across various user-specific domains while preserving user data privacy. We … |
Xuan Kan; Yonghui Xiao; Tien-Ju Yang; Nanxin Chen; Rajiv Mathews; | ArXiv | 2024-08-19 |
| 594 | Automatic Speech Recognition Techniques for Transcription of Thai Traditional Medicine Texts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Currently, Automatic Speech Recognition (ASR) technology is widely used for communication convenience in converting speech to text. Initial surveys reveal that existing studies on … |
Jettasic Popun; Wilaiporn Lee; Akara Prayote; | 2024 21st International SoC Design Conference (ISOCC) | 2024-08-19 |
| 595 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. |
YANGZE LI et. al. | arxiv-cs.SD | 2024-08-18 |
| 596 | Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and … |
Yinghao Aaron Li; Xilin Jiang; Jordan Darefsky; Ge Zhu; N. Mesgarani; | ArXiv | 2024-08-13 |
| 597 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness Via Noise Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. |
Wonjun Lee; San Kim; Gary Geunbae Lee; | arxiv-cs.CL | 2024-08-12 |
| 598 | Audio Enhancement for Computer Audition — An Iterative Training Paradigm Using Sample Importance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. |
Manuel Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Björn W. Schuller; | arxiv-cs.SD | 2024-08-12 |
| 599 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. |
Eunseop Yoon; Hee Suk Yoon; John Harvill; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2024-08-11 |
| 600 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. |
JUNHAO XU et. al. | arxiv-cs.CL | 2024-08-09 |
| 601 | ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ASR-enhancedMultimodal Product Representation Learning (AMPere). |
RUIXIANG ZHAO et. al. | arxiv-cs.MM | 2024-08-06 |
| 602 | Robust Speech Enhancement Using Dabauchies Wavelet Based Adaptive Wavelet Thresholding for The Development of Robust Automatic Speech Recognition: A Comprehensive Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Mahadevaswamy Shanthamallappa; | Wirel. Pers. Commun. | 2024-08-06 |
| 603 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. |
JAEYOUNG KIM et. al. | arxiv-cs.SD | 2024-08-05 |
| 604 | Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce a novel framework that diverges from typical second-pass rescoring methods. |
Yixuan Tang; Anthony K. H. Tung; | ijcai | 2024-08-03 |
| 605 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms Using Linguistic Features IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. |
PENG CHENG et. al. | arxiv-cs.CR | 2024-08-03 |
| 606 | MECOS: A Bilingual Manipuri-English Spontaneous Code-switching Speech Corpus for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Naorem Karline Singh; Y. J. Chanu; Hoomexsun Pangsatabam; | Comput. Speech Lang. | 2024-08-01 |
| 607 | On The Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. |
Nick Rossenbach; Ralf Schlüter; Sakriani Sakti; | arxiv-cs.CL | 2024-07-31 |
| 608 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. |
KOHEI MATSUURA et. al. | arxiv-cs.CL | 2024-07-31 |
| 609 | Improving Noisy Student Training for Low-resource Languages in End-to-End ASR Using CycleGAN and Inter-domain Losses Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial … |
C. Li; Ngoc Thang Vu; | ArXiv | 2024-07-26 |
| 610 | Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
HUKAI HUANG et. al. | arxiv-cs.CL | 2024-07-26 |
| 611 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. |
Jiwon Suh; Injae Na; Woohwan Jung; | arxiv-cs.CL | 2024-07-25 |
| 612 | On The Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). |
Benedikt Hilmes; Nick Rossenbach; and Ralf Schlüter; | arxiv-cs.CL | 2024-07-25 |
| 613 | Coupling Speech Encoders with Downstream Text Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. |
Ciprian Chelba; Johan Schalkwyk; | arxiv-cs.CL | 2024-07-24 |
| 614 | Quantifying The Role of Textual Predictability in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic–phonetic modelling. |
Sean Robertson; Gerald Penn; Ewan Dunbar; | arxiv-cs.CL | 2024-07-23 |
| 615 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. |
Hazim Bukhari; Soham Deshmukh; Hira Dhamyal; Bhiksha Raj; Rita Singh; | arxiv-cs.SD | 2024-07-21 |
| 616 | Recognition of Target Domain Japanese Speech Using Language Model Replacement Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should … |
Daiki Mori; Kengo Ohta; Ryota Nishimura; Atsunori Ogawa; N. Kitaoka; | EURASIP J. Audio Speech Music. Process. | 2024-07-20 |
| 617 | Low-Resourced Speech Recognition for Iu Mien Language Via Weakly-Supervised Phoneme-based Multilingual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. |
LUKUAN DONG et. al. | arxiv-cs.SD | 2024-07-18 |
| 618 | Low-Resourced Speech Recognition for Iu Mien Language Via Weakly-Supervised Phoneme-Based Multilingual Pretraining Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are … |
LUKUAN DONG et. al. | 2024 IEEE 14th International Symposium on Chinese Spoken … | 2024-07-18 |
| 619 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding By Provenance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic speech recognition (ASR) models trained on large amounts of audio data are now widely used to convert speech to written text in a variety of applications from video captioning to automated assistants used in healthcare and other domains. |
Changye Li; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-07-18 |
| 620 | Analyzing The Effects of Transcription Errors on Summary Generation of Bengali Spoken Documents Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) has become an indispensable part of the AI domain, with various speech technologies reliant on it. The quality of speech recognition depends on … |
Priyanjana Chowdhury; Nabanika Sarkar; Sanghamitra Nath; Utpal Sharma; | ACM Transactions on Asian and Low-Resource Language … | 2024-07-17 |
| 621 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present novel approaches that use a generative pretrained transformer (GPT) to identify paraphasias from transcripts as well as two end-to-end approaches that focus on modeling both automatic speech recognition (ASR) and paraphasia classification as multiple sequences vs. a single sequence. |
Matthew Perez; Aneesha Sampath; Minxue Niu; Emily Mower Provost; | arxiv-cs.CL | 2024-07-15 |
| 622 | Textless Dependency Parsing By Labeled Sequence Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper proposes a textless method for dependency parsing, examining its effectiveness and limitations. |
Shunsuke Kando; Yusuke Miyao; Jason Naradowsky; Shinnosuke Takamichi; | arxiv-cs.CL | 2024-07-14 |
| 623 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer Based Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. |
Wenbo Zhao; Ziwei Li; Chuan Yu; Zhijian Ou; | arxiv-cs.SD | 2024-07-14 |
| 624 | Empowering Whisper As A Joint Multi-Talker and Target-Talker Speech Recognition System IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2024-07-13 |
| 625 | HebDB: A Weakly Supervised Dataset for Hebrew Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. |
ARNON TURETZKY et. al. | arxiv-cs.CL | 2024-07-10 |
| 626 | Bilingual Speech Recognition for Taiwanese Hokkien and Mandarin: Computational Intelligence in Complex Acoustic Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper explores Taiwanese Hokkien, the second most spoken language in Taiwan, often intertwined with Mandarin. Existing Mandarin speech recognition struggles with the … |
Jeng-Shin Sheu; Aftab Ahmad; Jian-Min Li; | 2024 International Conference on Consumer Electronics – … | 2024-07-09 |
| 627 | Noise-Robust Automatic Speech Recognition: A Case Study for Communication Interference Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: An Automatic Speech Recognition (ASR) System is a software tool that converts a speech audio waveform into its corresponding text transcription. ASR systems are usually built … |
J. C. Duarte; Sérgio Colcher; | J. Interact. Syst. | 2024-07-09 |
| 628 | ¿Te Vienes? Sure! Joint Fine-tuning of Language Detection and Transcription Improves Automatic Recognition of Code-Switching Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Human communication in multilingual communities often leads to code-switching, where individuals seamlessly alternate between two or more languages in their daily interactions. … |
Leopold Hillah; Mateusz Dubiel; Luis A. Leiva; | Proceedings of the 6th ACM Conference on Conversational … | 2024-07-08 |
| 629 | Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation and Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and … |
MENGZHE GENG et. al. | IEEE Transactions on Audio, Speech and Language Processing | 2024-07-08 |
| 630 | Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The proposed model aims to develop a speech recognition technology for hearing, speech, or cognitively disabled people. All the available technology in the field of speech … |
Ritabrata Roy Choudhury; | ArXiv | 2024-07-07 |
| 631 | LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. |
HAECHAN KIM et. al. | arxiv-cs.CL | 2024-07-05 |
| 632 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. |
Vyas Raina; Mark Gales; | arxiv-cs.SD | 2024-07-05 |
| 633 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | arxiv-cs.CL | 2024-07-05 |
| 634 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study yields numerous significant findings that we are discussing in this paper. |
Salima Mdhaffar; Haroun Elleuch; Fethi Bougares; Yannick Estève; | arxiv-cs.CL | 2024-07-05 |
| 635 | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). |
KEYU AN et. al. | arxiv-cs.SD | 2024-07-04 |
| 636 | Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. |
Cong-Thanh Do; Shuhei Imai; Rama Doddipatla; Thomas Hain; | arxiv-cs.CL | 2024-07-04 |
| 637 | Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. |
Tiia Sildam; Andra Velve; Tanel Alumäe; | arxiv-cs.CL | 2024-07-04 |
| 638 | Improving Self-supervised Pre-training Using Accent-Specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. |
Darshan Prabhu; Abhishek Gupta; Omkar Nitsure; Preethi Jyothi; Sriram Ganapathy; | arxiv-cs.CL | 2024-07-04 |
| 639 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. |
Jinming Chen; Jingyi Fang; Yuanzhong Zheng; Yaoxuan Wang; Haojun Fei; | arxiv-cs.SD | 2024-07-03 |
| 640 | Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via … |
SHUJIE HU et. al. | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-07-03 |
| 641 | Natural Language Processing for Recognizing Bangla Speech with Regular and Regional Dialects: A Survey of Algorithms and Approaches Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Natural Language Processing (NLP) is one of the fundamental domains of Artificial Intelligence (AI). In this paper, we present a systematic review of NLP based research for … |
PARAMITA BASAK UPAMA et. al. | 2024 IEEE 48th Annual Computers, Software, and Applications … | 2024-07-02 |
| 642 | Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life … |
M. Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Bjorn W. Schuller; | Journal of Computer Science and Technology | 2024-07-01 |
| 643 | Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech modification methods normalize children’s speech towards adults’ speech, enabling off-the-shelf generic automatic speech recognition (ASR) for this low-resource scenario. … |
Abhijit Sinha; Mittul Singh; Sudarsana Reddy Kadiri; Mikko Kurimo; H. Kathania; | 2024 International Conference on Signal Processing and … | 2024-07-01 |
| 644 | Voice Conversion Based Data Augmentation Using CycleGAN for Children’s ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Extensive use of voice assistants by children in their day-to-day life activities demand for better performance of Automatic Speech Recognition (ASR) for children’s speech. The … |
D. Singh; Preet P. Amin; Hemant A. Patil; Hardik B. Sailor; | 2024 International Conference on Signal Processing and … | 2024-07-01 |
| 645 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-07-01 |
| 646 | PaSCoNT – Parallel Speech Corpus of Northern-central Thai for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
SUPAWAT TAERUNGRUANG et. al. | Comput. Speech Lang. | 2024-07-01 |
| 647 | Error Correction By Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we … |
YUCHUN SHU et. al. | ArXiv | 2024-06-29 |
| 648 | Less Is More: Accurate Speech Recognition & Translation Without Web-Scale Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that state-of-the art accuracy can be reached without relying on web-scale data. |
KRISHNA C. PUVVADA et. al. | arxiv-cs.CL | 2024-06-28 |
| 649 | Enhanced ASR Robustness to Packet Loss with A Front-End Adaptation Network Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose using a front-end adaptation network connected to a frozen ASR model. |
Yehoshua Dissen; Shiry Yonash; Israel Cohen; Joseph Keshet; | arxiv-cs.SD | 2024-06-27 |
| 650 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. |
ZHENG FANG et. al. | arxiv-cs.CR | 2024-06-27 |
| 651 | Command Recognition System Using Convolutional Neural Networks Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper addresses the problem of ASR (Automatic Speech Recognition), a crucial field for the advancement of HCI (Human-Computer Interaction) technologies and applications in … |
Sebastian-Alexandru Dragusin; N. Bizon; Robert-Nicolae Bostinaru; Denis Toma; | 2024 16th International Conference on Electronics, … | 2024-06-27 |
| 652 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. |
Ahmed Heakl; Youssef Zaghloul; Mennatullah Ali; Rania Hossam; Walid Gomaa; | arxiv-cs.CL | 2024-06-26 |
| 653 | Automatic Speech Recognition for Hindi Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations. |
Anish Saha; A. G. Ramakrishnan; | arxiv-cs.CL | 2024-06-26 |
| 654 | Sequence-to-Sequence Models in Italian Atypical Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In the domain of automatic speech recognition (ASR), we explore the usage of a state-of-the-art transformer-based sequence-to-sequence model to build a speaker-dependent isolated … |
Davide Mulfari; Lorenzo Carnevale; Massimo Villari; | 2024 IEEE Symposium on Computers and Communications (ISCC) | 2024-06-26 |
| 655 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. |
Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie; | arxiv-cs.SD | 2024-06-26 |
| 656 | Dynamic Data Pruning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. |
QIAO XIAO et. al. | arxiv-cs.CL | 2024-06-26 |
| 657 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. |
VAN TUNG PHAM et. al. | arxiv-cs.LG | 2024-06-25 |
| 658 | FASA: A Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children’s speech data from many of the existing noisy children’s speech data. |
Dancheng Liu; Jinjun Xiong; | arxiv-cs.CL | 2024-06-25 |
| 659 | Sequential Editing for Lifelong Training of Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. |
Devang Kulshreshtha; Saket Dingliwal; Brady Houston; Nikolaos Pappas; Srikanth Ronanki; | arxiv-cs.CL | 2024-06-25 |
| 660 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. |
Shuaishuai Ye; Shunfei Chen; Xinhui Hu; Xinkang Xu; | arxiv-cs.SD | 2024-06-25 |
| 661 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. |
ADRIANA FERNANDEZ-LOPEZ et. al. | arxiv-cs.CV | 2024-06-25 |
| 662 | Innovations in Real-Time Speech Translation: Leveraging Griffin-Lim Algorithm Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: At present, speech-to-speech translator systems often involve multiple intermediary steps, such as automatic speech recognition (ASR), text-to-text machine translation (MT), and … |
HIMANSHU MAITHANI et. al. | 2024 15th International Conference on Computing … | 2024-06-24 |
| 663 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present KIT’s offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. |
SAI KONERU et. al. | arxiv-cs.CL | 2024-06-24 |
| 664 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech … |
SAI KONERU et. al. | ArXiv | 2024-06-24 |
| 665 | Deep Learning Based Lip Reading for Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) is the process of using machine learning techniques to convert human speech into written text. This paper proposes a deep learning-based lip … |
M. Abishek; S. Harish; R. K. Prasath; D. Sudarshan; P. Supriya; | 2024 15th International Conference on Computing … | 2024-06-24 |
| 666 | Exploring The Capability of Mamba in Speech Applications IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. |
Koichi Miyazaki; Yoshiki Masuyama; Masato Murata; | arxiv-cs.SD | 2024-06-24 |
| 667 | Generative Adversarial Network-Based Voice Synthesis from Spectrograms for Low-Resource Speech Recognition in Mismatched Conditions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The use of Generative Adversarial Networks (GANs) has been increasing in speech recognition tasks but there has been significant hurdle due to limited availability. The use of GAN … |
Puneet Bawa; Virender Kadyan; Gunjan Chhabra; | 2024 15th International Conference on Computing … | 2024-06-24 |
| 668 | Enhancing Communication: Utilizing Transfer Learning for Improved Speech-to-Text Transcription Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) transforms spoken language into text facilitating interaction with technology and easing access to information for individuals facing challenges … |
S. D; S. Fazil; | 2024 15th International Conference on Computing … | 2024-06-24 |
| 669 | PI-Whisper: Designing An Adaptive and Incremental Automatic Speech Recognition System for Edge Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. |
AMIR NASSERELDINE et. al. | arxiv-cs.CL | 2024-06-21 |
| 670 | Perception of Phonological Assimilation By Neural Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). |
Charlotte Pouw; Marianne de Heer Kloots; Afra Alishahi; Willem Zuidema; | arxiv-cs.CL | 2024-06-21 |
| 671 | PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: As edge-based automatic speech recognition (ASR) technologies become increasingly prevalent for the development of intelligent and personalized assistants, three important … |
Amir Nassereldine; Dancheng Liu; Chenhui Xu; Jinjun Xiong; | ArXiv | 2024-06-21 |
| 672 | Massive End-to-end Speech Recognition Models with Time Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. |
WEIRAN WANG et. al. | naacl | 2024-06-20 |
| 673 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | naacl | 2024-06-20 |
| 674 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. |
Suyoung Kim; Jiyeon Hwang; Ho-Young Jung; | naacl | 2024-06-20 |
| 675 | Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. |
Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng; | arxiv-cs.AI | 2024-06-20 |
| 676 | Joint Vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. |
Alexander Blatt; Aravind Krishnan; Dietrich Klakow; | arxiv-cs.CL | 2024-06-19 |
| 677 | Children’s Speech Recognition Through Discrete Token Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate the integration of discrete speech tokens into children’s speech recognition systems as input without significantly degrading the ASR performance. |
Vrunda N. Sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2024-06-19 |
| 678 | Noise Robust Whisper Features for Dysarthric Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Dysarthria, a speech disorder stemming from difficulties in controlling the relevant muscles, presents formidable challenges to effective communication. This study proposes an … |
Japan Bhatt; Harsh Patel; Hemant A. Patil; | The Speaker and Language Recognition Workshop | 2024-06-18 |
| 679 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. |
Hayato Futami; Siddhant Arora; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe; | arxiv-cs.CL | 2024-06-18 |
| 680 | Bridging The Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. |
KUAN-CHEN WANG et. al. | arxiv-cs.SD | 2024-06-18 |
| 681 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; Jérôme Goulian; Benjamin Lecouteux; | arxiv-cs.CL | 2024-06-18 |
| 682 | Automatic Speech Recognition for Biomedical Data in Bengali Language Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR … |
Shariar Kabir; Nazmun Nahar; Shyamasree Saha; Mamunur Rashid; | ArXiv | 2024-06-16 |
| 683 | CoSTA: Code-Switched Speech Translation Using Aligned Speech-Text Interleaving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. |
Bhavani Shankar; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2024-06-16 |
| 684 | Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. |
Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen; | arxiv-cs.SD | 2024-06-16 |
| 685 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. |
Haoyu Wang; Guoqiang Hu; Guodong Lin; Wei-Qiang Zhang; Jian Li; | arxiv-cs.SD | 2024-06-14 |
| 686 | An Efficient Text Augmentation Approach for Contextualized Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. |
Naijun Zheng; Xucheng Wan; Kai Liu; Ziqing Du; Zhou Huan; | arxiv-cs.SD | 2024-06-14 |
| 687 | Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We introduce Speech ReaLLM, a new ASR architecture that marriesdecoder-onlyASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the … |
FRANK SEIDE et. al. | ArXiv | 2024-06-13 |
| 688 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a transcription-free method for joint training using only audio signals. |
WILLIAM RAVENSCROFT et. al. | arxiv-cs.SD | 2024-06-13 |
| 689 | LASER: Learning By Aligning Self-supervised Representations of Speech for Improving Content-related Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named LASER: Learning by Aligning Self-supervised Representations is presented. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-06-13 |
| 690 | Speech ReaLLM — Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Speech ReaLLM, a new ASR architecture that marries decoder-only ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. |
FRANK SEIDE et. al. | arxiv-cs.CL | 2024-06-13 |
| 691 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
ZIYANG ZHUANG et. al. | arxiv-cs.SD | 2024-06-13 |
| 692 | The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of … |
SHAREEF BABU KALLURI et. al. | ArXiv | 2024-06-13 |
| 693 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | arxiv-cs.CL | 2024-06-13 |
| 694 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. |
JIATONG SHI et. al. | arxiv-cs.SD | 2024-06-12 |
| 695 | Training Data Augmentation for Dysarthric Automatic Speech Recognition By Text-to-Dysarthric-Speech Synthesis IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. |
Wing-Zin Leung; Mattias Cross; Anton Ragni; Stefan Goetze; | arxiv-cs.SD | 2024-06-12 |
| 696 | Improving Child Speech Recognition with Augmented Child-like Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied … |
Yuanyuan Zhang; Zhengjun Yue; T. Patel; O. Scharenborg; | ArXiv | 2024-06-12 |
| 697 | Towards Unsupervised Speech Recognition Without Pronunciation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. |
JUNRUI NI et. al. | arxiv-cs.CL | 2024-06-12 |
| 698 | PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. |
TRANG LE et. al. | arxiv-cs.CL | 2024-06-11 |
| 699 | AS-70: A Mandarin Stuttered Speech Dataset for Automatic Speech Recognition and Stuttering Event Detection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. |
RONG GONG et. al. | arxiv-cs.SD | 2024-06-11 |
| 700 | Reading Miscue Detection in Primary School Through Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1\%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8\%). |
Lingyun Gao; Cristian Tejedor-Garcia; Helmer Strik; Catia Cucchiarini; | arxiv-cs.CL | 2024-06-11 |
| 701 | MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling Methods for Learning Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. |
Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah; | arxiv-cs.CL | 2024-06-09 |
| 702 | LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of … |
ZHESHU SONG et. al. | ArXiv | 2024-06-07 |
| 703 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2024-06-06 |
| 704 | Hypernetworks for Personalizing ASR to Atypical Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. |
Max Müller-Eberstein; Dianna Yee; Karren Yang; Gautam Varma Mantena; Colin Lea; | arxiv-cs.LG | 2024-06-06 |
| 705 | Error-preserving Automatic Speech Recognition of Young English Learners’ Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. |
JANICK MICHOT et. al. | arxiv-cs.CL | 2024-06-05 |
| 706 | Text Injection for Neural Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes contextual text injection (CTI) to enhance contextual ASR. |
ZHONG MENG et. al. | arxiv-cs.CL | 2024-06-05 |
| 707 | Discrete Multimodal Transformers with A Pretrained Large Language Model for Mixed-Supervision Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). |
VIET ANH TRINH et. al. | arxiv-cs.CL | 2024-06-04 |
| 708 | Efficiently Train ASR Models That Memorize Less and Perform Better with Per-core Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. |
LUN WANG et. al. | arxiv-cs.CR | 2024-06-04 |
| 709 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition Via Weakly Phonetic Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. |
Saierdaer Yusuyin; Te Ma; Hao Huang; Wenbo Zhao; Zhijian Ou; | arxiv-cs.SD | 2024-06-04 |
| 710 | Speaking of Accent: A Content Analysis of Accent Misconceptions in ASR Research Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) researchers are working to address the differing transcription performance of ASR by accent or dialect. However, research often has a limited … |
Kerri Prinos; Neal Patwari; Cathleen A. Power; | Proceedings of the 2024 ACM Conference on Fairness, … | 2024-06-03 |
| 711 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. |
Ara Yeroyan; Nikolay Karpov; | arxiv-cs.CL | 2024-06-03 |
| 712 | Pass The Butter: A Study on Desktop-classic Multitasking Robotic Arm Based on Advanced YOLOv7 and BERT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. |
HAOHUA QUE et. al. | arxiv-cs.RO | 2024-05-27 |
| 713 | Denoising LM: Pushing The Limits of Error Correction Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. |
ZIJIN GU et. al. | arxiv-cs.LG | 2024-05-24 |
| 714 | Let’s Fuse Step By Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Generative Fusion Decoding (GFD), a novel shallow fusion framework designed to integrate large language models (LLMs) into cross-modal text recognition systems for automatic speech recognition (ASR) and optical character recognition (OCR). |
CHAN-JAN HSU et. al. | arxiv-cs.CL | 2024-05-23 |
| 715 | You Don’t Understand Me!: Comparing ASR Results for L1 and L2 Speakers of Swedish Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. |
Ronald Cumbal; Birger Moell; Jose Lopes; Olof Engwall; | arxiv-cs.CL | 2024-05-22 |
| 716 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot’s ego speech using only a single-channel microphone. |
Yue Li; Florian A. Kunneman; Koen V. Hindriks; | arxiv-cs.HC | 2024-05-22 |
| 717 | Linguistic Analysis of Human-computer Interaction IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This article reviews recent literature investigating speech variation in production and comprehension during spoken language communication between humans and devices. Human speech … |
Georgia Zellou; Nicole Holliday; | Frontiers Comput. Sci. | 2024-05-21 |
| 718 | Non-autoregressive Real-time Accent Conversion Model with Voice Cloning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We have developed the non-autoregressive model for real-time accent conversion with voice cloning. |
Vladimir Nechaev; Sergey Kosyakov; | arxiv-cs.SD | 2024-05-21 |
| 719 | A Study on Speech Recognition By A Neural Network Based on English Speech Feature Parameters Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this study, from the perspective of English speech feature parameters, two feature parameters, the mel-frequency cepstral coefficient (MFCC) and filter bank (Fbank), were … |
Congmin Mao; Sujing Liu; | J. Adv. Comput. Intell. Intell. Informatics | 2024-05-20 |
| 720 | Understanding and Benchmarking The Commonality of Adversarial Examples Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech recognition system converts audio into texts by utilizing deep learning algorithms. Numerous works have demonstrated various adversarial example (AE) attacks, i.e., adding … |
Ruiwen He; Yushi Cheng; Junning Ze; Xiaoyu Ji; Wenyuan Xu; | 2024 IEEE Symposium on Security and Privacy (SP) | 2024-05-19 |
| 721 | Listen Again and Choose The Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth … |
YUCHEN HU et. al. | Annual Meeting of the Association for Computational … | 2024-05-16 |
| 722 | Listen Again and Choose The Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. |
YUCHEN HU et. al. | arxiv-cs.CL | 2024-05-16 |
| 723 | Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. … |
Ahmed Adel Attia; Dorottya Demszky; Tolúlopé Ògúnrèmí; Jing Liu; Carol Y. Espy-Wilson; | ArXiv | 2024-05-15 |
| 724 | Towards Evaluating The Robustness of Automatic Speech Recognition Systems Via Audio Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an attack on ASR systems based on user-customized style transfer. |
WEIFEI JIN et. al. | arxiv-cs.SD | 2024-05-15 |
| 725 | I Know What You Mean: Context-Aware Recognition to Enhance Speech-Based Games Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advances in language processing and speech recognition open up a large opportunity for video game companies to embrace voice interaction as an intuitive feature and … |
Nima Zargham; Mohamed Lamine Fetni; Laura Spillner; Thomas Muender; Rainer Malaka; | Proceedings of the CHI Conference on Human Factors in … | 2024-05-11 |
| 726 | Enhancing Communication Equity: Evaluation of An Automated Speech Recognition Application in Ghana Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In Ghana people who struggle to articulate speech as a result of different conditions experience barriers in interacting with others due to difficulties in being understood. … |
Gifty Ayoka; Giulia Barbareschi; Richard Cave; Catherine Holloway; | Proceedings of the 2024 CHI Conference on Human Factors in … | 2024-05-11 |
| 727 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a simple yet effective method to learn a universal acoustic realization of Whisper’s $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting’ the model. |
Vyas Raina; Rao Ma; Charles McGhee; Kate Knill; Mark Gales; | arxiv-cs.CL | 2024-05-09 |
| 728 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | arxiv-cs.CL | 2024-05-09 |
| 729 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. |
JINGGUANG TIAN et. al. | arxiv-cs.SD | 2024-05-08 |
| 730 | Open Implementation and Study of BEST-RQ for Speech Processing IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. |
Ryan Whetten; Titouan Parcollet; Marco Dinarelli; Yannick Estève; | arxiv-cs.CL | 2024-05-07 |
| 731 | Mixat: A Data Set of Bilingual Emirati-English Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. |
Maryam Al Ali; Hanan Aldarmaki; | arxiv-cs.CL | 2024-05-04 |
| 732 | Unveiling The Potential of LLM-Based ASR on Chinese Open-Source Datasets IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. |
XUELONG GENG et. al. | arxiv-cs.SD | 2024-05-03 |
| 733 | Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new … |
A. Bekarystankyzy; O. Mamyrbayev; Tolganay Anarbekova; | ACM Transactions on Asian and Low-Resource Language … | 2024-05-03 |
| 734 | Towards Fair and Inclusive Speech Recognition for Stuttering: Community-led Chinese Stuttered Speech Dataset Creation and Benchmarking Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Despite the widespread adoption of Automatic Speech Recognition (ASR) models in voice-operated products and conversational AI agents, current ASR models perform poorly for people … |
Qisheng Li; Shaomei Wu; | Extended Abstracts of the CHI Conference on Human Factors … | 2024-05-02 |
| 735 | Low-resource Speech Recognition and Dialect Identification of Irish in A Multi-task Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). |
Liam Lonergan; Mengjie Qian; Neasa Ní Chiaráin; Christer Gobl; Ailbhe Ní Chasaide; | arxiv-cs.CL | 2024-05-02 |
| 736 | Efficient Compression of Multitask Multilingual Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. |
Thomas Palmeira Ferraz; | arxiv-cs.CL | 2024-05-01 |
| 737 | Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Confidence scores of automatic speech recognition (ASR) outputs are often inadequately communicated, preventing its seamless integration into analytical workflows. In this paper, we introduce ConFides, a visual analytic system developed in collaboration with intelligence analysts to address this issue. |
Sunwoo Ha; Chaehun Lim; R. Jordan Crouser; Alvitta Ottley; | arxiv-cs.HC | 2024-04-30 |
| 738 | Convolutional Neural Networks to Facilitate The Continuous Recognition of Arabic Speech with Independent Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) is a field of research that focuses on the ability of computers to process and interpret speech feedback from humans and to provide the highest … |
Sally A. Sayed; R. A. A. Seoud; H. Y. A. Naby; | J. Electr. Comput. Eng. | 2024-04-29 |
| 739 | Toward Robust ASR System Against Audio Adversarial Examples Using Agitated Logit Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples, which aim at deceiving ASR systems by adding perturbations to benign speech signals. These … |
N. Park; Jong Kim; | ACM Transactions on Privacy and Security | 2024-04-26 |
| 740 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. |
RUBEN JANSSENS et. al. | arxiv-cs.CL | 2024-04-26 |
| 741 | Automatic Speech Recognition System-Independent Word Error Rate Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. |
Chanho Park; Mingjie Chen; Thomas Hain; | arxiv-cs.CL | 2024-04-25 |
| 742 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. |
Chihiro Taguchi; Jefferson Saransig; Dayana Velásquez; David Chiang; | arxiv-cs.CL | 2024-04-23 |
| 743 | Analyzing The Impact of HF-Specific Signal Degradation on Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Analog radio still constitutes an important fraction of military communications. In the context of a fully digitized radio frequency (RF) reconnaissance chain, this leads to … |
Fabian Fritz; Alessia Cornaggia; Lukas Henneke; Frank Kurth; Kevin Wilkinghoff; | 2024 International Conference on Military Communication and … | 2024-04-23 |
| 744 | Enhancing ASR Performance Through Relative Word Frequency in OCR and Normal Word Frequency Analysis Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With the growing interest in Conversational AI, a system that enables machines to engage in human-like dialogues, there has been an increased focus on Automatic Speech Recognition … |
KYUDAN JUNG et. al. | 2024 IEEE 6th International Conference on AI Circuits and … | 2024-04-22 |
| 745 | Semantically Corrected Amharic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. |
Samuael Adnew; Paul Pu Liang; | arxiv-cs.CL | 2024-04-20 |
| 746 | Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech … |
Hasmot Ali; M. F. Hossain; Md. Mehedi Hasan; Sheikh Abujar; S. R. H. Noori; | ArXiv | 2024-04-18 |
| 747 | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. |
Ye Bai; Chenxing Li; Hao Li; Yuanyuan Zhao; Xiaorui Wang; | arxiv-cs.SD | 2024-04-17 |
| 748 | Is The Same Performance Really The Same?: Understanding How Listeners Perceive ASR Results Differently According to The Speaker’s Accent Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Research suggests that automatic speech recognition (ASR) systems, which automatically convert speech to text, show different performances according to various input classes … |
Seoyoung Kim; Yeon Su Park; Dakyeom Ahn; Jin Myung Kwak; Juho Kim; | Proceedings of the ACM on Human-Computer Interaction | 2024-04-17 |
| 749 | Keep Decoding Parallel With Effective Knowledge Distillation From Language Models To End-To-End Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
M. Hentschel; Y. Nishikawa; T. Komatsu; Y. Fujita; | icassp | 2024-04-15 |
| 750 | Automatic Speech Recognition Tuned for Child Speech in The Classroom IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By tuning OpenAI’s Whisper model we achieve a 38% relative reduction in word error rate (WER) to 9.2% on the public MyST dataset of child speech – the lowest yet reported – and a 7% relative reduction to reach 54% WER on a more challenging classroom speech dataset (ISAT). |
R. Southwell; | icassp | 2024-04-15 |
| 751 | Automatic Channel Selection and Spatial Feature Integration for Multi-Channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
B. MU et. al. | icassp | 2024-04-15 |
| 752 | How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of processing distortion generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. |
K. Iwamoto; | icassp | 2024-04-15 |
| 753 | Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a multitask alternative to the joint training approach. |
S. Kumar; | icassp | 2024-04-15 |
| 754 | Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes two novel variants of neural reprogramming to enhance wake word recognition in streaming end-to-end ASR models without updating model weights. |
P. -J. Ku; | icassp | 2024-04-15 |
| 755 | SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes the first known end-to-end frame-work, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
C. -J. Lin; | icassp | 2024-04-15 |
| 756 | Stable Distillation: Regularizing Continued Pre-Training for Low-Resource Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper proposes Stable Distillation, a simple and novel approach for SSL-based continued pre-training that boosts ASR performance in the target domain where both labeled and unlabeled data are limited. |
A. Seth; S. Ghosh; S. Umesh; D. Manocha; | icassp | 2024-04-15 |
| 757 | Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the learning targets of automatic speech recognition (ASR) and other attribute recognition are apparently in conflict. Therefore, we propose to employ different adaptation methods for different tasks in multiple finetuning stages. |
Y. Gao; H. Shi; C. Chu; T. Kawahara; | icassp | 2024-04-15 |
| 758 | Leveraging Data Collection and Unsupervised Learning for Code-Switched Tunisian Arabic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. |
A. A. B. Abdallah; A. Kabboudi; A. Kanoun; S. Zaiem; | icassp | 2024-04-15 |
| 759 | Multilingual Distilwhisper: Efficient Distillation of Multi-Task Speech Models Via Language-Specific Experts IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. |
T. P. Ferraz; M. Zanon Boito; C. Brun; V. Nikoulina; | icassp | 2024-04-15 |
| 760 | Synthetic Conversations Improve Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, We propose a novel methodology called Systematic Synthetic Conversations (SSC), which leverages conventional ASR datasets to help an end-to-end (E2E) multi-talker ASR model establish new state-of-the-art results across synthetic and authentic multi-talker datasets. |
T. -B. Nguyen; A. Waibel; | icassp | 2024-04-15 |
| 761 | Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. |
S. Papi; | icassp | 2024-04-15 |
| 762 | Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). |
Y. Sudo; M. Shakeel; Y. Fukumoto; Y. Peng; S. Watanabe; | icassp | 2024-04-15 |
| 763 | Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. |
S. Ling; | icassp | 2024-04-15 |
| 764 | Large Language Models As A Proxy For Human Evaluation In Assessing The Comprehensibility Of Disordered Speech Transcription Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Human evaluation is the gold standard for this, but it can be laborious, slow, and expensive. In this work, we tune and evaluate large language models for this task and find them to be a much better proxy for human evaluators than other metrics commonly used. |
K. Tomanek; | icassp | 2024-04-15 |
| 765 | Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we explore the problem of personalization from a domain adaptation perspective and highlight the potential risk of overfitting associated with synthesized speech. |
D. -H. Kim; J. -H. Lee; J. -H. Chang; | icassp | 2024-04-15 |
| 766 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
S. Shen; Y. Gao; F. Liu; H. Wang; A. Zhou; | icassp | 2024-04-15 |
| 767 | BRAVEn: Improving Self-supervised Pre-training for Visual and Auditory Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
A. Haliassos; A. Zinonos; R. Mira; S. Petridis; M. Pantic; | icassp | 2024-04-15 |
| 768 | Cross-Modal Parallel Training for Improving End-to-end Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a Cross-modal Parallel Training (CPT) approach for improving the accent robustness of state-of-the-art Conformer-Transducer (Conformer-T) ASR system. |
R. Dong; Y. Li; D. Xu; Y. Long; | icassp | 2024-04-15 |
| 769 | Leveraging Large Language Models for Exploiting ASR Uncertainty IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. |
P. Dighe; | icassp | 2024-04-15 |
| 770 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term bi-level joint unsupervised and supervised training (BL-JUST). |
A. F. M. SAIF et. al. | icassp | 2024-04-15 |
| 771 | Augmenting Conformers With Structured State-Space Sequence Models For Online Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. |
H. SHAN et. al. | icassp | 2024-04-15 |
| 772 | Corpus Synthesis for Zero-Shot ASR Domain Adaptation Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. |
H. Su; | icassp | 2024-04-15 |
| 773 | Sparsely Shared Lora on Whisper for Child Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. |
W. Liu; Y. Qin; Z. Peng; T. Lee; | icassp | 2024-04-15 |
| 774 | Speech Collage: Code-Switched Audio Generation By Collaging Monolingual Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. |
A. Hussein; | icassp | 2024-04-15 |
| 775 | Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. |
W. Chen; T. Kano; A. Ogawa; M. Delcroix; S. Watanabe; | icassp | 2024-04-15 |
| 776 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
J. He; X. Shi; X. Li; T. Toda; | icassp | 2024-04-15 |
| 777 | Boosting End-to-End Multilingual Phoneme Recognition Through Exploiting Universal Speech Attributes Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. |
H. Yen; S. M. Siniscalchi; C. -H. Lee; | icassp | 2024-04-15 |
| 778 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
H. Wang; P. Guo; P. Zhou; L. Xie; | icassp | 2024-04-15 |
| 779 | FastInject: Injecting Unpaired Text Data Into CTC-Based ASR Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint training method, named FastInject, which efficiently injects multi-domain unpaired text data into CTC-based ASR training. |
K. Deng; P. C. Woodland; | icassp | 2024-04-15 |
| 780 | Significant ASR Error Detection for Conversational Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a system that can determine, to a high degree of accuracy, whether the semantics of a predicted and reference transcript are significantly different. |
J. Harvill; | icassp | 2024-04-15 |
| 781 | Exploring Adapters with Conformers for Children’s Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we explore an alternative approach known as Adapter transfer. |
T. Rolland; A. Abad; | icassp | 2024-04-15 |
| 782 | Task Vector Algebra for ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two novel applications of task vectors to ASR. |
G. Ramesh; K. Audhkhasi; B. Ramabhadran; | icassp | 2024-04-15 |
| 783 | Dementia Assessment Using Mandarin Speech with An Attention-Based Speech Recognition Encoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. |
Z. -J. LIN et. al. | icassp | 2024-04-15 |
| 784 | LCB-Net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
F. Yu; H. Wang; X. Shi; S. Zhang; | icassp | 2024-04-15 |
| 785 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. Ronny Huang; | icassp | 2024-04-15 |
| 786 | Improving Attention-Based End-to-End Speech Recognition By Monotonic Alignment Attention Matrix Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: On the contrary, some studies have shown that for non-streaming attention-based models, monotonic alignment is beneficial to model performance. Based on this motivation, we propose the enhanced Gaussian Monotonic Alignment (e-GMA), which reduces the difficulty of learning monotonic alignment, and the reconstructed attention matrix leads to an improved accuracy in ASR tasks. |
Z. Zhuang; | icassp | 2024-04-15 |
| 787 | Improved Children’s Automatic Speech Recognition Combining Adapters and Synthetic Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we use Adapters to handle the domain mismatch when fine-tuning with TTS data. |
T. Rolland; A. Abad; | icassp | 2024-04-15 |
| 788 | Connecting Speech Encoder and Large Language Model for ASR IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. |
W. Yu; | icassp | 2024-04-15 |
| 789 | AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise.When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses.This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. |
J. Lin; | icassp | 2024-04-15 |
| 790 | Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. |
Y. Yang; | icassp | 2024-04-15 |
| 791 | Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
H. Wang; | icassp | 2024-04-15 |
| 792 | Loss Masking Is Not Needed In Decoder-Only Transformer For Discrete-Token-Based ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose to model speech tokens in an autoregressive way, similar to text. |
Q. Chen; | icassp | 2024-04-15 |
| 793 | Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). |
J. Xie; | icassp | 2024-04-15 |
| 794 | Task Oriented Dialogue As A Catalyst for Self-Supervised Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. |
D. M. Chan; S. Ghosh; H. Tulsiani; A. Rastrow; B. Hoffmeister; | icassp | 2024-04-15 |
| 795 | Prompting Large Language Models with Speech Recognition Abilities IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we extend the capabilities of LLM by directly attaching a small audio encoder allowing it to perform speech recognition. |
Y. Fathullah; | icassp | 2024-04-15 |
| 796 | VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. |
S. MAITI et. al. | icassp | 2024-04-15 |
| 797 | Towards Automatic Data Augmentation for Disordered Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. |
Z. Jin; | icassp | 2024-04-15 |
| 798 | Multi-Modality Speech Recognition Driven By Background Visual Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we combined the AVNS dataset (providing background sound) with the largest benchmark LRS3 dataset (providing target speech) to create adverse noise conditions for the AVSR model. |
C. Luo; Y. Liu; W. Sun; Z. Sun; | icassp | 2024-04-15 |
| 799 | Build A 50+ Hours Chinese Mandarin Corpus for Children’s Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The purpose of our research is to establish a children’s speech corpus for children’s speech recognition. |
H. Xu; J. Yang; J. Wang; W. Hu; | icassp | 2024-04-15 |
| 800 | Enhancing Code-Switching Speech Recognition With Interactive Language Biases IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The interaction between various resolutions of language biases is subsequently explored in this work. |
H. Liu; L. P. Garcia; X. Zhang; A. W. H. Khong; S. Khudanpur; | icassp | 2024-04-15 |
| 801 | Concealing Medical Condition By Node Toggling in ASR for Dementia Patients Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on learning ASR for dementia patients without revealing their medical condition. |
W. -T. Hsu; C. -P. Chen; C. -C. Lee; | icassp | 2024-04-15 |
| 802 | Extending Whisper with Prompt Tuning to Target-Speaker ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
H. Ma; Z. Peng; M. Shao; J. Li; J. Liu; | icassp | 2024-04-15 |
| 803 | Personalization of CTC-Based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. |
Z. Lei; | icassp | 2024-04-15 |
| 804 | Inappropriate Pause Detection in Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
J. Lee; Y. Choi; T. -J. Song; M. -W. Koo; | icassp | 2024-04-15 |
| 805 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
C. Simic; T. Bocklet; | icassp | 2024-04-15 |
| 806 | ViLaS: Exploring The Effects of Vision and Language Context in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We explore various cross-modal fusion schemes, analyze fine-grained cross-modal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition. |
Z. Ni; | icassp | 2024-04-15 |
| 807 | Correction Focused Language Model Training For Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. |
Y. Ma; Z. Liu; O. Kalinli; | icassp | 2024-04-15 |
| 808 | Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, when the target speaker is not accurately separated, ASR models face limitations in reaching their peak performance. To address this issue, we propose a speaker-adaptive ASR framework that possesses more implicit target speaker enhancement capability by efficiently joint-optimized speaker recognition (SR) and ASR models. |
M. Wu; | icassp | 2024-04-15 |
| 809 | PromptASR for Contextualized ASR with Controllable Style IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. |
X. Yang; | icassp | 2024-04-15 |
| 810 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning with Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
K. Everson; | icassp | 2024-04-15 |
| 811 | USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. |
S. Ding; | icassp | 2024-04-15 |
| 812 | Improving Kinyarwanda Speech Recognition Via Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we empirically show that using self-supervised pretraining, following a curriculum schedule during fine-tuning and using semi-supervised learning improve speech recognition for Kinyarwanda. |
A. Nzeyimana; | icassp | 2024-04-15 |
| 813 | A Study on The Adverse Impact of Synthetic Speech on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, synthetic speech is prone to be mixed with real human speech as part of the noise and recorded by the microphone, which leads to performance decrease for speech recognition. To address this issue, we propose different methods to study the adverse impact of synthetic speech on speech recognition, thereby enhancing its robustness. |
J. Huang; Y. Bai; Y. Cai; W. Bian; | icassp | 2024-04-15 |
| 814 | Small-Footprint Automatic Speech Recognition System Using Two-Stage Transfer Learning Based Symmetrized Ternary Weight Network Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional automatic speech recognition (ASR) models face challenges when deployed on edge devices due to their high computational requirements and storage demands. To address this issue, we present a novel ASR system specifically designed for edge applications, encompassing both keyword spotting (KWS) and speaker verification (SV) functionalities with on chip learning for speaker registration. |
X. Zhang; H. Kou; C. Xia; H. Cai; B. Liu; | icassp | 2024-04-15 |
| 815 | SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. |
H. WANG et. al. | icassp | 2024-04-15 |
| 816 | AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Adaptive Maximum Entropy Regularization (AdaMER), a technique that can modulate the impact of entropy regularization throughout the training process. |
S. EOM et. al. | icassp | 2024-04-15 |
| 817 | Extending Large Language Models for Speech and Audio Captioning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, automatic speech recognition (ASR) and automatic audio captioning (AAC) are often achieved with separate systems, resulting in incomplete auditory perception abilities. To fill in these gaps, in this paper, we present the first study that achieves both ASR and AAC by connecting an LLM with auditory encoders. |
C. Tang; | icassp | 2024-04-15 |
| 818 | SALM: Speech-Augmented Language Model with In-Context Learning for Speech Recognition and Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present a novel Speech Augmented Language Model (SALM) with multitask and in-context learning capabilities. |
Z. Chen; | icassp | 2024-04-15 |
| 819 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
M. Jagielski; O. Thakkar; L. Wang; | icassp | 2024-04-15 |
| 820 | Can Whisper Perform Speech-Based In-Context Learning? IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. |
S. Wang; C. -H. Yang; J. Wu; C. Zhang; | icassp | 2024-04-15 |
| 821 | One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). |
S. Cornell; J. -W. Jung; S. Watanabe; S. Squartini; | icassp | 2024-04-15 |
| 822 | Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, there is still substantial space to improve as complex computation of visual modules and ineffective fusion of audio-visual modalities. To eliminate these drawbacks, we propose a down-up sampling-based AVSR model (Hourglass-AVSR) to enjoy high efficiency and performance, whose time length is scaled during the intermediate processing, resembling an hourglass. |
F. Yu; H. Wang; Z. Ma; S. Zhang; | icassp | 2024-04-15 |
| 823 | End-to-End Speech Translation with Mutual Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we find that triple-task MTL (ST+MT+ASR) suffers from a knowledge transfer limitation that leads to performance stagnation compared with dual-task MTL (ST+MT or ST+ASR). |
H. Wang; Z. Xue; Y. Lei; D. Xiong; | icassp | 2024-04-15 |
| 824 | LITEVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
H. LAUX et. al. | icassp | 2024-04-15 |
| 825 | Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the E2E architecture doesn’t explicitly address the modeling of overlapping speech areas, potentially limiting the model’s ability to generalize. To tackle this issue, we introduce two approaches: overlap-aware encoding method and monotonic attention loss. |
T. LI et. al. | icassp | 2024-04-15 |
| 826 | Contextual Biasing Methods for Improving Rare Word Detection in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Typically, such words have limited occurrences in training data, making it impractical to retrain the ASR system. This paper explores innovative word-boosting techniques to improve the detection rate of such rare words in the ASR hypotheses for the ATC domain. |
M. Bhattacharjee; | icassp | 2024-04-15 |
| 827 | Attention-Guided Adaptation for Code-Switching Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. |
B. Aditya; M. Rohmatillah; L. -H. Tai; J. -T. Chien; | icassp | 2024-04-15 |
| 828 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the code-switching phenomenon. |
T. -T. Yang; H. -W. Wang; Y. -C. Wang; C. -H. Lin; B. Chen; | icassp | 2024-04-15 |
| 829 | Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. |
E. Kim; Y. Tang; T. Ki; D. Neelagiri; V. R. Apsingek; | icassp | 2024-04-15 |
| 830 | Are Soft Prompts Good Zero-Shot Learners for Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). |
D. Ng; | icassp | 2024-04-15 |
| 831 | CSNet: Contrastive Siamese Network for Robust SLU Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a siamese network with contrastive learning to enhance SLU effects. |
H. Yang; M. Zhang; D. Wei; J. Guo; | icassp | 2024-04-15 |
| 832 | SCORE: Self-Supervised Correspondence Fine-Tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
A. Meghanani; T. Hain; | icassp | 2024-04-15 |
| 833 | Unsupervised Speech Recognition with N-skipgram and Positional Unigram Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. |
L. Wang; M. Hasegawa-Johnson; C. D. Yoo; | icassp | 2024-04-15 |
| 834 | Learning Speech Representation from Contrastive Token-Acoustic Pretraining IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named Contrastive Token-Acoustic Pretraining (CTAP), which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. |
C. Qiang; | icassp | 2024-04-15 |
| 835 | Folding Attention: Memory and Power Optimization for On-Device Transformer-Based Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage.To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. |
Y. Li; | icassp | 2024-04-15 |
| 836 | Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, generalized LoRA is used to refine the state-of-the-art cascaded conformer transducer model. |
A. Baby; G. Joseph; S. Singh; | icassp | 2024-04-15 |
| 837 | Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper proposes two novel variants of neural reprogramming to enhance wake word recognition in streaming end-to-end ASR models without updating model weights. The first, … |
PIN-JUI KU et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 838 | Robust Speaker Personalisation Using Generalized Low-Rank Adaptation for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: For voice assistant systems, personalizing automated speech recognition (ASR) to a customer is the proverbial holy grail. Careful selection of hyper-parameters will be necessary … |
Arun Baby; George Joseph; Shatrughan Singh; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 839 | Synthetic Conversations Improve Multi-Talker ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent times, automatic speech recognition (ASR) has seen remarkable progress, particularly in recognizing dominant speakers. Nevertheless, the challenge of multi-talker … |
Thai-Binh Nguyen; Alexander Waibel; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 840 | End-to-End Speech Translation with Mutual Knowledge Distillation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multi-task learning (MTL) is widely used to improve end-to-end speech translation (ST), which implicitly transfer knowledge from auxiliary automatic speech recognition (ASR) … |
Hao Wang; Zhengshan Xue; Yikun Lei; Deyi Xiong; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 841 | Task Vector Algebra for ASR Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Vector representations of text and speech signals such as word2vec and wav2vec are used commonly in automatic speech recognition (ASR) and spoken language understanding systems. … |
Gowtham Ramesh; Kartik Audhkhasi; B. Ramabhadran; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 842 | Large Language Models As A Proxy For Human Evaluation In Assessing The Comprehensibility Of Disordered Speech Transcription Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered … |
KATRIN TOMANEK et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 843 | A Study on The Adverse Impact of Synthetic Speech on Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The high-quality synthetic speech by TTS has been widely used in the field of human-computer interaction, bringing users better experience. However, synthetic speech is prone to … |
Jian Huang; Yancheng Bai; Yang Cai; Wei Bian; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 844 | Automatic Speech Recognition Tuned for Child Speech in The Classroom IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: K-12 school classrooms have proven to be a challenging environment for Automatic Speech Recognition (ASR) systems, both due to background noise and conversation, and differences … |
ROSY SOUTHWELL et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 845 | Extending Large Language Models for Speech and Audio Captioning IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet … |
CHANGLI TANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 846 | The Fosafer System for The ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper presents the Fosafer’s submissions to the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), which includes both the Automatic Speech … |
Shangkun Huang; Yuxuan Du; Yankai Wang; Jing Deng; Rong Zheng; | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
| 847 | Improved Children’s Automatic Speech Recognition Combining Adapters and Synthetic Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Children’s automatic speech recognition (ASR) poses a significant challenge due to the high variability nature of children’s speech. The limited availability of training datasets … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 848 | Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: End-to-end (E2E) multi-speaker speech recognition with the serialized output training (SOT) strategy demonstrates good performance in modeling diverse speaker scenarios. However, … |
TAO LI et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 849 | Exploring Adapters with Conformers for Children’s Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The high variability in acoustic, pronunciation, and linguistic characteristics of children’s speech makes of children’s automatic speech recognition (ASR) a complex task. … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 850 | Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by … |
SHASHI KUMAR et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 851 | Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This study investigates the effective finetuning of a pretrained model using adapters for speech emotion recognition (SER). Since emotion is related with linguistic and prosodic … |
Yuan Gao; Hao Shi; Chenhui Chu; Tatsuya Kawahara; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 852 | Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging … |
William Chen; Takatomo Kano; A. Ogawa; Marc Delcroix; Shinji Watanabe; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
| 853 | SIR-Progressive Audio-Visual TF-Gridnet with ASR-Aware Selector for Target Speaker Extraction in MISP 2023 Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: TF-GridNet has demonstrated its effectiveness in speech separation and enhancement. In this paper, we extend its capabilities for progressive audio-visual speech enhancement by … |
ZHONGSHU HOU et. al. | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
| 854 | Automatic Speech Recognition Advancements for Indigenous Languages of The Americas Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. |
Monica Romero; Sandra Gomez; Ivan G. Torre; | arxiv-cs.CL | 2024-04-12 |
| 855 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. |
Tien-Hong Lo; Fu-An Chao; Tzu-I Wu; Yao-Ting Sung; Berlin Chen; | arxiv-cs.SD | 2024-04-11 |
| 856 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in The Medical Domain IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present VietMed – a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. |
Khai Le-Duc; | arxiv-cs.CL | 2024-04-08 |
| 857 | Mai Ho’omāuna I Ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. |
Kaavya Chaparala; Guido Zarrella; Bruce Torres Fischer; Larry Kimura; Oiwi Parker Jones; | arxiv-cs.CL | 2024-04-03 |
| 858 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
Matthew Jagielski; Om Thakkar; Lun Wang; | arxiv-cs.LG | 2024-04-02 |
| 859 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
Alexandros Haliassos; Andreas Zinonos; Rodrigo Mira; Stavros Petridis; Maja Pantic; | arxiv-cs.CV | 2024-04-02 |
| 860 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
Siyuan Shen; Yu Gao; Feng Liu; Hanyang Wang; Aimin Zhou; | arxiv-cs.SD | 2024-03-28 |
| 861 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. |
YASH JAIN et. al. | arxiv-cs.CL | 2024-03-28 |
| 862 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. |
Yi-Cheng Wang; Hsin-Wei Wang; Bi-Cheng Yan; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-03-26 |
| 863 | Allot? Is A Lot! Towards Developing More Generalized Speech Recognition System for Accessible Communication Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The proliferation of Automatic Speech Recognition (ASR) systems has revolutionized translation and transcription. However, challenges persist in ensuring inclusive communication … |
Grisha Bandodkar; Shyam Agarwal; Athul Krishna Sughosh; Sahilbir Singh; Taeyeong Choi; | AAAI Conference on Artificial Intelligence | 2024-03-24 |
| 864 | More Than Words: Advancements and Challenges in Speech Recognition for Singing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. |
Anna Kruspe; | arxiv-cs.SD | 2024-03-14 |
| 865 | Multimodal Information Fusion Method in Emotion Recognition in The Background of Artificial Intelligence Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advances in Semantic IoT data integration have highlighted the importance of multimodal fusion in emotion recognition systems. Human emotions, formed through innate … |
Zhen Dai; Hongxiao Fei; Chunyan Lian; | Internet Technology Letters | 2024-03-12 |
| 866 | A Review on Gujarati Language Based Automatic Speech Recognition (ASR) Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Mohit Dua; Bhavesh Bhagat; Shelza Dua; N. Chakravarty; | Int. J. Speech Technol. | 2024-03-12 |
| 867 | Automatic Speech Recognition (ASR) for The Diagnosis of Pronunciation of Speech Sound Disorders in Korean Children Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. |
TAEKYUNG AHN et. al. | arxiv-cs.CL | 2024-03-12 |
| 868 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. |
Jiayu Du; Jinpeng Li; Guoguo Chen; Wei-Qiang Zhang; | arxiv-cs.CL | 2024-03-12 |
| 869 | Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: While Automatic Speech Recognition (ASR) systems excel in controlled environments, challenges arise in robot-specifc setups due to unique microphone requirements and added noise … |
Antonio Andriella; Raquel Ros; Yoav Ellinson; Sharon Gannot; S. Lemaignan; | 2024 19th ACM/IEEE International Conference on Human-Robot … | 2024-03-11 |
| 870 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-03-10 |
| 871 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. |
Yufeng Yang; Ashutosh Pandey; DeLiang Wang; | arxiv-cs.SD | 2024-03-10 |
| 872 | A New Benchmark for Evaluating Automatic Speech Recognition in The Arabic Call Domain Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. |
QUSAI ABO OBAIDAH et. al. | arxiv-cs.AI | 2024-03-07 |
| 873 | Kirigami IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) … |
Sudershan Boovaraghavan; Haozhe Zhou; Mayank Goel; Yuvraj Agarwal; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2024-03-06 |
| 874 | JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent … |
Chang Sun; Hong Yang; Bo Qin; | ArXiv | 2024-03-04 |
| 875 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) … |
Joonas Kalda; Clément Pagés; R. Marxer; Tanel Alumäe; Hervé Bredin; | The Speaker and Language Recognition Workshop | 2024-03-04 |
| 876 | Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This survey offers a comprehensive review of DTL, FL, and RL-based ASR frameworks, aiming to provide insights into the latest developments and aid researchers and professionals in understanding the current challenges. |
Hamza Kheddar; Mustapha Hemis; Yassine Himeur; | arxiv-cs.SD | 2024-03-02 |
| 877 | A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio … |
Tyler Benster; G. Wilson; Reshef Elisha; Francis R. Willett; S. Druckmann; | ArXiv | 2024-03-02 |
| 878 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. |
Heyang Liu; Yu Wang; Yanfeng Wang; | arxiv-cs.CL | 2024-03-01 |
| 879 | Towards Inclusive Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Siyuan Feng; B. Halpern; O. Kudina; O. Scharenborg; | Comput. Speech Lang. | 2024-03-01 |
| 880 | Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
Jeehyun Lee; Yerin Choi; Tae-Jin Song; Myoung-Wan Koo; | arxiv-cs.CL | 2024-02-29 |
| 881 | Probing The Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). |
Quentin Raymondaud; Mickael Rouvier; Richard Dufour; | arxiv-cs.SD | 2024-02-29 |
| 882 | Exploration of Adapter for Noise Robust Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study thoroughly investigates adapter-based ASR adaptation in noisy environments. |
Hao Shi; Tatsuya Kawahara; | arxiv-cs.SD | 2024-02-28 |
| 883 | Robust Speech Recognition Using Meta-Learning for Low-Resource Accents Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Robust accented speech recognition is a challenging task in the field of automatic speech recognition (ASR). Accurate recognition of low-resource accents can significantly improve … |
Dhanva Eledath; Arun Baby; Shatrughan Singh; | 2024 National Conference on Communications (NCC) | 2024-02-28 |
| 884 | A Multitask Co-training Framework for Improving Speech Translation By Leveraging Speech Recognition and Machine Translation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Yue Zhou; Yuxuan Yuan; Xiaodong Shi; | Neural Comput. Appl. | 2024-02-27 |
| 885 | Large Language Models Are Efficient Learners of Noise-Robust Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM.The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. |
YUCHEN HU et. al. | iclr | 2024-02-26 |
| 886 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the codeswitching phenomenon. |
Tzu-Ting Yang; Hsin-Wei Wang; Yi-Cheng Wang; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-02-26 |
| 887 | LipVoicer: Generating Speech from Silent Videos Guided By Lip Reading IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. |
Yochai Yemini; Aviv Shamsian; Lior Bracha; Sharon Gannot; Ethan Fetaya; | iclr | 2024-02-26 |
| 888 | It’s Never Too Late: Fusing Acoustic Information Into Large Language Models for Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). |
CHEN CHEN et. al. | iclr | 2024-02-26 |
| 889 | Efficient Data Selection Employing Semantic Similarity-based Graph Structures for Model Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent developments in natural language processing (NLP) have highlighted the need for substantial amounts of data for models to capture textual information accurately. This … |
Roxana Petcu; Subhadeep Maji; | ArXiv | 2024-02-22 |
| 890 | Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. |
Qiushi Zhu; Jie Zhang; Yu Gu; Yuchen Hu; Lirong Dai; | aaai | 2024-02-20 |
| 891 | Noise-Robust Multilingual Speech Recognition and The Tatar Speech Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: After focusing on individual languages for a long time, multilingual automatic speech recognition has recently become an active area of research. For instance, Whisper by OpenAI … |
SAIDA MUSSAKHOJAYEVA et. al. | 2024 International Conference on Artificial Intelligence in … | 2024-02-19 |
| 892 | OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). |
Yifan Peng; Yui Sudo; Muhammad Shakeel; Shinji Watanabe; | arxiv-cs.CL | 2024-02-19 |
| 893 | Phantom in The Opera: Adversarial Music Attack for Robot Dialogue System Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This study explores the vulnerability of robot dialogue systems’ automatic speech recognition (ASR) module to adversarial music attacks. Specifically, we explore music as a … |
Sheng Li; Jiyi Li; Yang Cao; | Frontiers Comput. Sci. | 2024-02-15 |
| 894 | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | arxiv-cs.CL | 2024-02-13 |
| 895 | The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems’ performance in multilingual settings. |
Ajinkya Kulkarni; Anna Tokareva; Rameez Qureshi; Miguel Couceiro; | arxiv-cs.CL | 2024-02-12 |
| 896 | Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for … |
HEESEUNG KIM et. al. | Advances in Neural Information Processing Systems 37 | 2024-02-08 |
| 897 | Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | arxiv-cs.CL | 2024-02-08 |
| 898 | A Comprehensive Study of The Current State-of-the-Art in Nepali Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). |
Rupak Raj Ghimire; Bal Krishna Bal; Prakash Poudyal; | arxiv-cs.SD | 2024-02-05 |
| 899 | Digits Micro-model for Accurate and Secure Transactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. |
Chirag Chhablani; Nikhita Sharma; Jordan Hosier; Vijay K. Gurbani; | arxiv-cs.LG | 2024-02-02 |
| 900 | Streaming Sequence Transduction Through Dynamic Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. |
WEITING TAN et. al. | arxiv-cs.CL | 2024-02-02 |
| 901 | AccentFold: A Journey Through African Accents for Zero-Shot ASR Adaptation to Target Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). |
Abraham Toluwase Owodunni; Aditya Yadavalli; Chris Chinenye Emezue; Tobi Olatunji; Clinton C Mbataku; | arxiv-cs.CL | 2024-02-02 |
| 902 | Exploring The Limits of Decoder-only Models Trained on Public Speech Recognition Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. |
Ankit Gupta; George Saon; Brian Kingsbury; | arxiv-cs.CL | 2024-01-31 |
| 903 | Improving ASR Performance with OCR Through Using Word Frequency Difference IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recently, there has been a growing interest in conversational artificial intelligence (AI). As a result, research is actively being conducted on automatic speech recognition (ASR) … |
Kyudan Jung; Seungmin Bae; N. Kim; Hyun Gon Ryu; Hyuk-Jae Lee; | 2024 International Conference on Electronics, Information, … | 2024-01-28 |
| 904 | Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent research highlights the dependency of BPE subword tokenization’s efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. |
Ahnaf Mozib Samin; | arxiv-cs.CL | 2024-01-27 |
| 905 | Toward Practical Automatic Speech Recognition and Post-Processing: A Call for Explainable Error Benchmark Guideline Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. |
SEONMIN KOO et. al. | arxiv-cs.CL | 2024-01-25 |
| 906 | SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
CHYI-JIUNN LIN et. al. | arxiv-cs.CL | 2024-01-24 |
| 907 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
Jiajun He; Xiaohan Shi; Xingfeng Li; Tomoki Toda; | arxiv-cs.CL | 2024-01-24 |
| 908 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. RONNY HUANG et. al. | arxiv-cs.CL | 2024-01-23 |
| 909 | Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
Michael Hentschel; Yuta Nishikawa; Tatsuya Komatsu; Yusuke Fujita; | arxiv-cs.CL | 2024-01-22 |
| 910 | Using Large Language Model for End-to-End Chinese ASR and NER IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. |
YUANG LI et. al. | arxiv-cs.CL | 2024-01-20 |
| 911 | Speech Recognition Model Inspired on Large Language Model for Smart Grid Dispatching Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent years, large language models have gained popularity across various domains, with particular attention given to the impressive performance of their core component, the … |
QIONG-LAN NA et. al. | Proceedings of the 2024 International Conference on Power … | 2024-01-19 |
| 912 | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. |
Hao Wang; Shuhei Kurita; Shuichiro Shimizu; Daisuke Kawahara; | arxiv-cs.CV | 2024-01-18 |
| 913 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. |
A F M SAIF et. al. | arxiv-cs.CL | 2024-01-13 |
| 914 | LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
Fan Yu; Haoxu Wang; Xian Shi; Shiliang Zhang; | arxiv-cs.SD | 2024-01-12 |
| 915 | XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This research paper focuses on the development and evaluation of Automatic Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims to improve ASR … |
Panji Arisaputra; Alif Tri Handoyo; Amalia Zahra; | ArXiv | 2024-01-12 |
| 916 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. |
JIAXIN GUO et. al. | arxiv-cs.CL | 2024-01-11 |
| 917 | End to End Hindi to English Speech Conversion Using Bark, MBART and A Finetuned XLSR Wav2Vec2 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a … |
Aniket Tathe; Anand Kamble; Suyash Kumbharkar; Atharva Bhandare; Anirban C. Mitra; | ArXiv | 2024-01-11 |
| 918 | Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: \textbf{Objectives}: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the “Cookie Theft” picture description task. |
Changye Li; Weizhe Xu; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-01-10 |
| 919 | High-precision Voice Search Query Correction Via Retrievable Speech-text Embedings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. |
CHRISTOPHER LI et. al. | arxiv-cs.CL | 2024-01-08 |
| 920 | A New MmWave-Speech Multimodal Speech System for Voice User Interface Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Voice user interface (VUI) plays an essential role in intelligent scenes, e.g., smart homes. It provides a hands- and eyes-free human-machine interaction between humans and … |
Tiantian Liu; Feng Lin; | GetMobile: Mobile Computing and Communications | 2024-01-08 |
| 921 | An Audio-quality-based Multi-strategy Approach for Target Speaker Extraction in The MISP 2023 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2024-01-08 |
| 922 | Cross-Speaker Encoding Network for Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. |
JIAWEN KANG et. al. | arxiv-cs.SD | 2024-01-08 |
| 923 | ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. |
HE WANG et. al. | arxiv-cs.SD | 2024-01-07 |
| 924 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
He Wang; Pengcheng Guo; Pan Zhou; Lei Xie; | arxiv-cs.SD | 2024-01-07 |
| 925 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
KEVIN EVERSON et. al. | arxiv-cs.CL | 2024-01-05 |
| 926 | Research on The Application of Speech Database Based on Emotional Feature Extraction in International Chinese Education and Teaching Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The advanced analysis of the relationship between acoustic and emotional characteristics of speech signals can effectively improve the interactivity and intelligence of computers. … |
Xiangli Zhang; | Scalable Comput. Pract. Exp. | 2024-01-04 |
| 927 | An Approach for Speech Enhancement in Low SNR Environments Using Granular Speaker Embedding Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The proliferation of speech technology applications has led to an unprecedented demand for effective speech enhancement techniques, particularly in low Signal-to-Noise Ratio (SNR) … |
Jayasree Saha; Rudrabha Mukhopadhyay; A. Agrawal; Surabhi Jain; C. V. Jawahar; | Proceedings of the 7th Joint International Conference on … | 2024-01-04 |
| 928 | Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. |
Rita Frieske; Bertram E. Shi; | arxiv-cs.CL | 2024-01-03 |
| 929 | Improving Speech Recognition with Jargon Injection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper introduces a new method that improves the performance of Automatic speech recognition (ASR) engines, e.g., Whisper in practical cases. Different from prior methods that … |
MINH-TIEN NGUYEN et. al. | SIGDIAL Conferences | 2024-01-01 |
| 930 | Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
SAMI DHAHBI et. al. | Int. J. Interact. Multim. Artif. Intell. | 2024-01-01 |
| 931 | Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: An attention-based automatic speech recognition (ASR) model generates a probability distribution of the tokens set at each time step. Recent studies have shown that calibration … |
Mun-Hak Lee; Joon-hyuk Chang; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 932 | Effect of Modeling Glottal Activity Parameters on Zero-Shot Children’s ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The primary objective of this study is to enhance the recognition performance of zero-shot children’s automatic speech recognition (ASR) task. In such a setup, statistical models … |
.. Ankita; .. Shambhavi; S. Shahnawazuddin; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 933 | E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the … |
Zheng Liang; Ziyang Ma; Chenpeng Du; Kai Yu; Xie Chen; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 934 | MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
Xuankai Chang; Pengcheng Guo; Yuya Fujita; Takashi Maekaku; Shinji Watanabe; | IEEE Signal Process. Lett. | 2024-01-01 |
| 935 | Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched … |
Sei Ueno; Akinobu Lee; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 936 | BanSpeech: A Multi-Domain Bangla Speech Recognition Benchmark Toward Robust Performance in Challenging Conditions Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Despite huge improvements in automatic speech recognition (ASR) employing neural networks, ASR systems still suffer from a lack of robustness and generalizability issues due to … |
AHNAF MOZIB SAMIN et. al. | IEEE Access | 2024-01-01 |
| 937 | MoE-SLU: Towards ASR-Robust Spoken Language Understanding Via Mixture-of-Experts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: As a crucial task in the task-oriented dialogue systems, spoken language understanding (SLU) has garnered increasing attention. However, errors from automatic speech recognition … |
XUXIN CHENG et. al. | Annual Meeting of the Association for Computational … | 2024-01-01 |
| 938 | Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper proposes a novel meta-transfer learning method to improve automatic speech recognition (ASR) performance in low-resource languages. Nowadays, we are witnessing high … |
Rui Zhou; Takaki Koshikawa; Akinori Ito; Takashi Nose; Chia-Ping Chen; | IEEE Access | 2024-01-01 |
| 939 | Keyword Guided Target Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This letter presents a new target speech recognition problem, where the target speech is defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we … |
Yinghao Shi; Lantian Li; Dong Wang; Jiqing Han; | IEEE Signal Processing Letters | 2024-01-01 |
| 940 | ASQ: An Ultra-Low Bit Rate ASR-Oriented Speech Quantization Method Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: For efficient transmission of speech signals, speech compression methodologies have attracted significant research attention for decades and are widely used in automatic speech … |
Lingxuan Ye; Changfeng Gao; Gaofeng Cheng; Liuping Luo; Qingwei Zhao; | IEEE Signal Processing Letters | 2024-01-01 |
| 941 | Enhancing Automatic Speech Recognition With Personalized Models: Improving Accuracy Through Individualized Fine-Tuning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) systems have become increasingly popular in recent years due to their ability to convert spoken language into text. Nonetheless, despite their … |
V. BRYDINSKYI et. al. | IEEE Access | 2024-01-01 |
| 942 | Chinese Spoken Named Entity Recognition in Real-world Scenarios: Dataset and Approaches Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Spoken Named Entity Recognition (NER) aims 001 to extract entities from speech. The extracted 002 entities can help voice assistants better under-003 stand user’s questions and … |
SHILIN ZHOU et. al. | Annual Meeting of the Association for Computational … | 2024-01-01 |
| 943 | Arabic Speech Recognition: Advancement and Challenges IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Speech recognition is a captivating process that revolutionizes human-computer interactions, allowing us to interact and control machines through spoken commands. The foundation … |
ASHIFUR RAHMAN et. al. | IEEE Access | 2024-01-01 |
| 944 | Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the accuracy and robustness of speech recognition systems with the assistance of visual cues in … |
Jiahong Li; Chenda Li; Yifei Wu; Yanmin Qian; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 945 | Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal … |
Lester Phillip Violeta; D. Ma; Wen-Chin Huang; T. Toda; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 946 | Exploring Native and Non-Native English Child Speech Recognition With Whisper Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Modern end-to-end Automatic Speech Recognition (ASR) systems struggle to recognise children’s speech. This challenge is due to the high acoustic variability in children’s voices … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | IEEE Access | 2024-01-01 |
| 947 | Fine-Tuning ASR Models for Very Low-Resource Languages: A Study on Mvskoke Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recent advancements in multilingual models for automatic speech recognition (ASR) have been able to achieve a high accuracy for languages with extremely limited resources. This … |
Julia Mainzinger; Gina-Anne Levow; | Annual Meeting of the Association for Computational … | 2024-01-01 |
| 948 | Dynamic Sampling-Based Meta-Learning Using Multilingual Acoustic Data for Under-Resourced Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Under-resourced automatic speech recognition (ASR) has become an active field of research and has experienced significant progress during the past decade. However, the performance … |
I-Ting Hsieh; Chung-Hsien Wu; Zhe-Hong Zhao; | IEEE Access | 2024-01-01 |
| 949 | Dysarthric Speech Recognition Using Pseudo-Labeling, Self-Supervised Feature Learning, and A Joint Multi-Task Learning Approach Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this paper, we investigate the use of the spontaneous speech of dysarthric people for training an automatic speech recognition (ASR) model for them. Although the spontaneous … |
R. Takashima; Yuya Sawa; Ryo Aihara; Tetsuya Takiguchi; Yoshie Imai; | IEEE Access | 2024-01-01 |
| 950 | ESAformer: Enhanced Self-Attention for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and … |
Junhua Li; Zhikui Duan; Shiren Li; Xinmei Yu; Guangguang Yang; | IEEE Signal Processing Letters | 2024-01-01 |
| 951 | Explainability of Speech Recognition Transformers Via Gradient-Based Attention Visualization Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how … |
Tianli Sun; Haonan Chen; Guosheng Hu; Lianghua He; Cairong Zhao; | IEEE Transactions on Multimedia | 2024-01-01 |
| 952 | Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE … |
Hao Shi; M. Mimura; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
| 953 | Semantic Role Labeling from Chinese Speech Via End-to-End Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Semantic Role Labeling (SRL), crucial for understanding semantic relationships in sentences, has traditionally focused on text-based input. However, the increasing use of voice … |
Huiyao Chen; Xinxin Li; Meishan Zhang; Min Zhang; | Annual Meeting of the Association for Computational … | 2024-01-01 |
| 954 | Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. … |
Yukun Ma; Chong Zhang; Qian Chen; Wen Wang; Bin Ma; | IEEE Signal Processing Letters | 2024-01-01 |
| 955 | Adversarial Attacks on Automatic Speech Recognition (ASR): A Survey IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic Speech Recognition (ASR) systems have improved and eased how humans interact with devices. ASR system converts an acoustic waveform into the relevant text form. Modern … |
Amisha Bhanushali; Hyunjun Mun; J. Yun; | IEEE Access | 2024-01-01 |
| 956 | NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: With the advancement of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) technologies, high-quality speech conversion can now be achieved by extracting source speech … |
Danyang Cao; Zeyi Zhang; Jinyuan Zhang; | IEEE Signal Processing Letters | 2024-01-01 |
| 957 | Group Fairness in Multilingual Speech Recognition Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: We evaluate the performance disparity of the Whisper and MMS families of ASR models across the VoxPopuli and Common Voice multilingual datasets, with an eye toward … |
Anna Zee; Marc Zee; Anders Søgaard; | NAACL-HLT | 2024-01-01 |
| 958 | Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address … |
Yuriy Perezhohin; Tiago Santos; Victor Costa; Fernando Peres; Mauro Castelli; | IEEE Access | 2024-01-01 |
| 959 | Joint Speech-Text Embeddings for Multitask Speech Processing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called … |
Michael Gian Gonzales; Peter Corcoran; Naomi Harte; Michael Schukat; | IEEE Access | 2024-01-01 |
| 960 | Mel-Scale Frequency Extraction and Classification of Dialect-Speech Signals With 1D CNN Based Classifier for Gender and Region Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Humans communicate and interact through natural languages, such as American English (AE), Taiwanese, Italian, and numerous variants of Spanish. Through automatic speech analysis … |
HSIANG-YUEH LAI et. al. | IEEE Access | 2024-01-01 |
| 961 | Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training … |
Geon Woo Lee; Hong Kook Kim; Duk-Jo Kong; | IEEE Access | 2024-01-01 |
| 962 | Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
HUIMENG WANG et. al. | arxiv-cs.SD | 2023-12-31 |
| 963 | KEBAP: Korean Error Explainable Benchmark Dataset for ASR and Post-processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional evaluation metrics for ASR systems produce a singular aggregate score, which is insufficient for understanding specific system vulnerabilities. Therefore, we aim to address the limitations of the previous ASR evaluation methods by introducing the Korean Error Explainable Benchmark Dataset for ASR and Post-processing (KEBAP). |
SEONMIN KOO et. al. | emnlp | 2023-12-22 |
| 964 | Accented Speech Recognition With Accent-specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. |
Darshan Prabhu; Preethi Jyothi; Sriram Ganapathy; Vinit Unni; | emnlp | 2023-12-22 |
| 965 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | emnlp | 2023-12-22 |
| 966 | CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. |
Zishan Guo; Linhao Yu; Minghui Xu; Renren Jin; Deyi Xiong; | emnlp | 2023-12-22 |
| 967 | Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we attempt to resolve structurally ambiguous utterances into unambiguous texts in Indonesian using prosodic information. |
RUHIYAH WIDIAPUTRI et. al. | emnlp | 2023-12-22 |
| 968 | CLAD-ST: Contrastive Learning with Adversarial Data for Robust Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We address this robustness problem in downstream MT models by forcing the MT encoder to bring the representations of a noisy input closer to its clean version in the semantic space. This is achieved by introducing a contrastive learning method that leverages adversarial examples in the form of ASR outputs paired with their corresponding human transcripts to optimize the network parameters. |
Sathish Indurthi; Shamil Chollampatt; Ravi Agrawal; Marco Turchi; | emnlp | 2023-12-22 |
| 969 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
Christopher Simic; Tobias Bocklet; | arxiv-cs.SD | 2023-12-21 |
| 970 | Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. |
ANIRUDH S. SUNDAR et. al. | arxiv-cs.LG | 2023-12-21 |
| 971 | SpokesBiz — An Open Corpus of Conversational Polish Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We outline the general structure and content of the corpus, showcasing selected applications in linguistic research, evaluation and improvement of automatic speech recognition (ASR) systems |
PIOTR PĘZIK et. al. | arxiv-cs.CL | 2023-12-19 |
| 972 | SpokesBiz – An Open Corpus of Conversational Polish Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This paper announces the early release of SpokesBiz, a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and comprising over 650 hours of … |
PIOTR PEZIK et. al. | ArXiv | 2023-12-19 |
| 973 | Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this … |
Peng Shen; Xugang Lu; Hisashi Kawai; | ArXiv | 2023-12-18 |
| 974 | Towards Robust Packet Loss Concealment System With ASR-Guided Representations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Despite the significant advancements and promising performance of deep learning-based packet loss concealment (PLC) systems in transmission systems, their focus on modeling … |
Dali Yang; Joon-Hyuk Chang; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 975 | Gated Multi Encoders and Multitask Objectives for Dialectal Speech Recognition in Indian Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In this work, several methods have been proposed towards improving the performance of dialectal automatic speech recognition (ASR). A novel encoder architecture has been … |
SATHVIK UDUPA et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 976 | Ending The Blind Flight: Analyzing The Impact of Acoustic and Lexical Factors on WAV2VEC 2.0 in Air-Traffic Control Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Transformer neural networks have shown remarkable success on standard automatic speech recognition (ASR) benchmarks. However, they are known to be less robust against domain … |
Alexander Blatt; Badr M. Abdullah; D. Klakow; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 977 | Hierarchical Attention-Based Contextual Biasing For Personalized Speech Recognition Using Neural Transducers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Although end-to-end (E2E) automatic speech recognition (ASR) systems excel in general tasks, they frequently struggle with accurately recognizing personal rare words. Leveraging … |
Sibo Tong; Philip Harding; Simon Wiesler; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 978 | Parameter-Efficient Cross-Language Transfer Learning for A Language-Modular Audiovisual Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In audiovisual speech recognition (AV-ASR), for many languages only few audiovisual data is available. Building upon an English model, in this work, we first apply and analyze … |
ZHENGYANG LI et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 979 | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Transfer learning from large multilingual pretrained models, like XLSR, has become the new paradigm for Automatic Speech Recognition (ASR). Considering their ever-increasing size, … |
GEOFFROY VANDERREYDT et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 980 | Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Much of the recent progress in automatic speech recognition (ASR) lies in developing an acoustic encoder, such as enlarging its capacity and designing a refined architecture for … |
Yosuke Higuchi; Andrew Rosenberg; Yuan Wang; M. Baskar; B. Ramabhadran; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 981 | Conformer-Based Speech Recognition On Extreme Edge-Computing Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. |
MINGBIN XU et. al. | arxiv-cs.LG | 2023-12-16 |
| 982 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR Task IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Indonesia is home to roughly 700 languages, which amounts to about ten percent of the global total, positioning it as the second-most linguistically diverse country after Papua … |
S. Sakti; Benita Angela Titalim; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 983 | Not All Errors Are Created Equal: Evaluating The Impact of Model and Speaker Factors on ASR Outcomes in Clinical Populations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Pathological speech analysis with Automatic Speech Recognition (ASR) is a long-standing research domain based on the expectation that there is a close relationship between … |
D. WIEPERT et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 984 | Can Unpaired Textual Data Replace Synthetic Speech in ASR Model Adaptation? Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: To boost training and adaptation of end to end (E2E) automatic speech recognition (ASR) models, several approaches to use paired speech-text input together with unpaired text … |
Pasquale D’Alterio; Christian Hensel; Bashar Awwad Shiekh Hasan; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
| 985 | IFF-WAV2VEC: Noise Robust Low-Resource Speech Recognition Based on Self-supervised Learning and Interactive Feature Fusion Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: In recent years, self-supervised learning representation (SSLR) has shown remarkable performance in low-resource speech recognition. However, it lacks consideration for the … |
Jing Cao; Zhaopeng Qian; Chongchong Yu; Tao Xie; | Proceedings of the 2023 6th Artificial Intelligence and … | 2023-12-16 |
| 986 | Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most past studies have simplified the learning complexity of the model by splitting the code-switching task into multiple tasks dealing with a single language and then learning the domain-specific knowledge of each language separately. Therefore, in this paper, we attempt to introduce language identification information into the middle layer of the ASR model’s encoder. |
Tzu-Ting Yang; Hsin-Wei Wang; Berlin Chen; | arxiv-cs.CL | 2023-12-15 |
| 987 | Automatic Channel Selection and Spatial Feature Integration for Multi-channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
BINGSHEN MU et. al. | arxiv-cs.SD | 2023-12-15 |
| 988 | LiteVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
HENDRIK LAUX et. al. | arxiv-cs.CV | 2023-12-15 |
| 989 | Knowledge Prompt for Whisper: An ASR Entity Correction Approach with Knowledge Base Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Entity correction is crucial in Automatic Speech TABLE I Recognition (ASR), since erroneous entities seriously affect our understanding of ASR results. In this paper, in order to … |
MIN ZHANG et. al. | 2023 IEEE International Conference on Big Data (BigData) | 2023-12-15 |
| 990 | Improvement of Automatic Speech Recognition Systems Utilizing 2D Adaptive Wavelet Transformation Applied to Recurrence Plot of Speech Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Save |
S. Firooz; F. Almasganj; Yasser Shekofteh; | Signal, Image and Video Processing | 2023-12-15 |
| 991 | On The Compression of Shallow Non-causal ASR Models Using Knowledge Distillation and Tied-and-reduced Decoder for Low-latency On-device Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. |
NAGARAJ ADIGA et. al. | arxiv-cs.SD | 2023-12-15 |
| 992 | Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising … |
Fan Yu; Haoxu Wang; Ziyang Ma; Shiliang Zhang; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
| 993 | FastInject: Injecting Unpaired Text Data Into CTC-Based ASR Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the … |
Keqi Deng; Phil Woodland; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
| 994 | Towards Automatic Data Augmentation for Disordered Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data … |
ZENGRUI JIN et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
| 995 | Extending Whisper with Prompt Tuning to Target-speaker ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
Hao Ma; Zhiyuan Peng; Mingjie Shao; Jing Li; Ju Liu; | arxiv-cs.CL | 2023-12-13 |
| 996 | ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. |
Xincheng Yu; Dongyue Guo; Jianwei Zhang; Yi Lin; | arxiv-cs.SD | 2023-12-10 |
| 997 | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. |
Wonjun Lee; Gary Geunbae Lee; Yunsu Kim; | arxiv-cs.CL | 2023-12-06 |
| 998 | Leveraging Cross Lingual Speech Representations To Build ASR For Under-resourced Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Automatic speech recognition (ASR) technology can help document and preserve under-resourced tribal languages by converting spoken words into text. But building robust Automatic … |
Sougata Mukherjee; Prashant Bannulmath; D. K T; S. R. M. Prasanna; | 2023 26th Conference of the Oriental COCOSDA International … | 2023-12-04 |
| 999 | Taiwanese Hakka Across Taiwan Corpus and Formosa Speech Recognition Challenge 2023 – Hakka ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: To revive the endangered Taiwanese Hakka language, the first large-scale Taiwanese Hakka speech corpus across Taiwan (HAT) was developed, representing modern Taiwanese Hakka … |
YUAN-FU LIAO et. al. | 2023 26th Conference of the Oriental COCOSDA International … | 2023-12-04 |
| 1000 | Speech Recognition Applications in Enhancing Safety for Women in Built Environment Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: This research delves into the potential of speech recognition technology to enhance the safety of women in urban environments. Given the increasing concerns about safety for women … |
Mani Gupta; Rashmi Ashtt; Monali Wankar; Ajay Monga; | 2023 26th Conference of the Oriental COCOSDA International … | 2023-12-04 |