Abstract
Over the past decade, the field of Speech Generation has seen significant progress in enhancing speech quality and naturalness. Despite these advancements, persistent challenges such as speech noise, limited high-quality data availability, and the lack of robustness in speech generation systems remain. Additionally, the evaluation of speech presents a significant obstacle for comprehensive assessment at scale. Concurrently, recent breakthroughs in Large Language Models (LLMs) have revolutionized text generation and natural language processing. However, the complexity of spoken language introduces unique hurdles, including managing long speech waveform sequences. In this presentation, I will explore recent innovations in speech synthesis with spoken language modeling, evaluation for generative speech systems and high-fidelity speech enhancement. Finally, I will discuss prospective avenues for future research aimed at addressing these challenges.
Bio
Soumi Maiti is a postdoctoral researcher at Language Technologies Institute, Carnegie Mellon University, where she works on speech and language processing. Her research broadly focuses on building intelligent systems that can communicate with humans naturally. She earned a Ph.D. from the Graduate Center, City University of New York (CUNY) with the Graduate Center Fellowship advised by Prof Michael Mandel. She earned her B.Tech. in Computer Science from the Indian Institute of Engineering Science and Technology, Shibpur. Previously, she has worked in the Text-To-Speech team at Apple. She has also worked at Google and Interactions LLC as a student researcher and research intern. She has worked as an adjunct lecturer at Brooklyn College, CUNY, for three years and served as a Math Fellow at Hunter College. She has served as session chair in ICASSP 2024, ICASSP 2023, SLT 2023 and others, and area chair at EMNLP 2023.
Hackerman Hall B17 @ 3400 N. Charles Street, Baltimore, MD 21218