Abstract
Over the past three decades, the fields of automatic speech recognition (ASR) and machine translation (MT) have witnessed remarkable advancements, leading to exciting research directions such as speech-to-text translation (ST). This talk will delve into the domain of conversational ST, an essential facet of daily communication, which presents unique challenges including spontaneous informal language, the presence of disfluencies, high context dependence and a scarcity of ST paired data.
Conversational speech is notably characterized by its reliance on short segments, requiring the integration of broader contexts to maintain consistency and improve the translation’s fluency and quality. Incorporating longer contexts has been shown to benefit machine translation, but the inclusion of context in E2E-ST remains under-studied. Previous approaches have used simple concatenation of audio inputs for context, leading to memory bottlenecks, especially in self-attention networks, due to the encoding of lengthy audio segments.
First, I will describe how to integrate the context into E2E-ST with minimum additional memory cost. Then, I will discuss the challenges of incorporating context in an E2E-ST system with limited data during training and inference and propose solutions to overcome them. Afterward, I will illustrate the impact of context size and the inclusion of speaker information on performance. Lastly, I will demonstrate the benefits of context in conversational settings focusing on aspects like anaphora resolution and the identification of named entities.
Hackerman Hall B17 @ 3400 N. Charles Street, Baltimore, MD 21218