🎉

CVPR’23] One paper has been accepted

One paper has been accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023. CVPR is the premier annual computer vision event.

This paper is invited as an invited paper talk in Workshop on Sound and Sight, in conjunction with CVPR 2023. Also, it is presented in Workshop on AI4CC: AI for Content Creation Workshop, in conjunction with CVPR 2023.

Title: Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment Authors: Kim Sung-Bin (POSTECH), Arda Senocak (KAIST), Hyunwoo Ha (POSTECH), Andrew Owens (University of Michigan), Tae-Hyun Oh (POSTECH)

How does audio describe the world around us? In this paper, we explore the task of generating an image of the visual scenery that sound comes from. However, this task has inherent challenges, such as a significant modality gap between audio and visual signals, and incongruent audio-visual pairs. We propose a self-supervised model by scheduling the learning procedure of each model component to associate these heterogeneous modalities despite their information gaps.The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. Thereby, we translate input audio to visual feature, followed by a powerful pre-trained generator to generate an image.
We further incorporate a highly correlated audio-visual pair selection method to stabilize the training. As a result, our method demonstrates substantially better quality in a large number of categories on VEGAS and VGGSound datasets, compared to the prior arts of sound-to-image generation. Besides, we show the spontaneously learned output controllability of our method by applying simple manipulations on the input in the waveform space or latent space.