Aligning Video and Textual Sequences

NeuMATCH is the first end-to-end neural network for matching multimodal sequences (e.g., text and video). The industrial movie production pipeline creates a movie script and a movie video, but no correspondence between these two modalities. By aligning the video sequence with the text sequence, we can establish such correspondence, which lays the groundwork for computational understanding of movie content.

End-to-end training is great, but is it panacea for everything? In this paper published at WACV 2021, we show that naive end-to-end training for a complex network like NeuMATCH is inefficient. We propose to align the pace of training and feature distributions across network components to improve training.

Image and Video Captioning

