Hot Topics in Automatic Speech Recognition

  • Course Program

Module 1:  End-to-End Automatic Speech Recognition

The field of automatic speech recognition (ASR) is currently led by end-to-end (E2E) models that directly convert spoken input into textual output. This course offers a comprehensive overview of E2E ASR models and highlights recent advancements in the domain. To achieve both high accuracy and low latency, the presenter will describe the application of masking strategies to Transformer Transducer architectures. The presentation will also cover technologies leveraging text-only data for general model training as well as methods for adapting models to new domains via augmentation and factorization. Further, the course will address E2E modeling for complex multi-speaker ASR through serialized output training. The extension of learning in E2E ASR to areas beyond speech recognition, such as speech translation and the development of robust speech foundation models, will also be explored. The course will conclude with an examination of the latest progress in large-language-model-based speech systems.

Module 2:  Automatic Meeting Transcription

Automatic meeting transcription is concerned with scripting conversations, enriched with information about who spoke when. This is a challenging task, because the speech signal captured by microphones from a distance is noisy and reverberated, and, depending on the nature of the meeting, can contain a high degree of overlapped speech, where more than one speaker is active at a time. Also, the interaction dynamics, where speakers articulate themselves in an intermittent manner, pose problems to conventional enhancement and recognition systems. Multi-talker meeting transcription thus calls for solving several tasks: source separation, diarization, and speech recognition. We will discuss approaches that address those tasks either separately or jointly. We will also touch upon ""ad-hoc"" configurations, where several, initially unsynchronized, microphones at unknown positions are used for signal capture.

Module 3: Inclusive Speech Technology

Automatic speech recognition (ASR) is increasingly used, e.g., in emergency response centers, domestic voice assistants, and search engines. Because of the paramount relevance spoken language plays in our lives, it is critical that ASR systems are able to deal with the variability in the way people speak (e.g., due to speaker differences, demographics, different speaking styles, and differently abled users). ASR systems promise to deliver objective interpretation of human speech. Practice and recent evidence however suggests that the state-of-the-art ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents. The overarching goal in our research is to uncover bias in ASR systems to work towards proactive bias mitigation in ASR. In this talk, I will present systematic experiments aimed at quantifying, identifying the origin of, and mitigating the bias of state-of-the-art ASRs on speech from different, typically low-resource, groups of speakers, with a focus on bias against gender, age, regional accents and non-native accents.

Instructors

Jinyu Li

Jinyu Li received the B.E. and M.E. degrees in electrical engineering and information system from University of Science and Technology of China, Hefei, China, in 1997 and 2000, respectively. He received the Ph.D. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, GA, USA in 2008.

Dr. Li is currently the Partner Applied Science Manager at Microsoft in Redmond, WA, USA, where he oversees a scientific team focused on developing and advancing speech modeling algorithms and technologies. Dr. Li is an IEEE Fellow, for contributions to deep-learning-based speech technology innovation and commercialization. Dr. Li served as a member of the IEEE Speech and Language Processing Technical Committee from 2018 to 2023 and as Vice Chair starting in 2026. He was an Associate Editor for the IEEE/ACM Transactions on Audio, Speech and Language Processing from 2015 to 2020 and acted as Technical Program Chair for IEEE SLT in 2021 and 2026, as well as IEEE ASRU in 2023 and 2025. He is a Distinguished Industry Speaker of the IEEE Signal Processing Society in 2025. He also received the IEEE SPS Best Paper Award in 2025. Additionally, Dr. Li was honored as the Industrial Distinguished Leader at the Asia-Pacific Signal and Information Processing Association (APSIPA) in 2021 and received the APSIPA Sadaoki Furui Prize Paper Award in 2023.

Reinhold Haeb-Umbach

Reinhold Haeb-Umbach is a professor of Communications Engineering at Paderborn University, Germany. He holds a Ph.D. from RWTH Aachen University, Germany, and has been working in industrial research labs for more than ten years before joining Paderborn University in 2001. From 2015 - 2020 he was member of the IEEE Speech and Language Technical Committee, and since 2022 he is member of the IEEE Audio and Acoustics Signal Processing Technical Committee. He is a fellow of the International Speech Communication Association (ISCA) and of the IEEE. His main research interests are in the fields of statistical signal processing and machine learning, with applications to speech enhancement, automatic speech recognition and unsupervised learning from speech and audio.

Odette Scharenborg

Odette Scharenborg is a Full Professor of Inclusive Speech Communication and head of the Delft Inclusive Speech Communication (DISC) lab, which is part of the Multimedia Computing Group at Delft University of Technology, the Netherlands. Her research aims to develop inclusive speech technology, i.e. making speech technology available for everyone irrespective of how they speak or what language they speak. In her research, Odette considers technical aspects as well as ethical and societal aspects. She is interested in anything and everything speech, ranging from human to automatic speech processing.

From 2017-2025, Odette was on the Board of the International Speech Communication Association (ISCA), the largest international society on speech science and technology. She served as ISCA vice-president from 2021-2023 and as President from 2023-2025. In 2025, she was the General Chair of Interspeech 2025. From 2018-2022, Odette was a member of the IEEE Speech and Language Processing Technical Committee (subarea Speech Production and Perception). From 2019-2023, she served as an (Senior) Associate Editor of IEEE Signal Processing Letters.

Publication Year: 2026


Hot Topics in Automatic Speech Recognition
  • Course Provider: Signal Processing Society
  • Course Number: SPSILN004
  • Credits: 0.3 CEU/ 3 PDH