🎉

ICCV’23] 3 main and 4 workshop papers have been accepted

Three Papers have been accepted to ICCV 2023

International Conference on Computer Vision (ICCV) is the premier international computer vision event comprising the main conference and several co-located workshops and tutorials.

Title: Scratching Visual Transformer’s Back with Uniform Attention

Authors: Nam Hyeon-Woo (POSTECH), Kim Yu-Ji (POSTECH), Byeongho Heo (NAVER AI LAB), Dongyoon Han (NAVER AI Lab), Seong Joon Oh (University of Tübingen), Tae-Hyun Oh (POSTECH)

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA), which enables global interactions at each layer of

a ViT model. Previous works acknowledge the property of long-range dependency for the effectiveness in MSA. In this work, we study the role of MSA in terms of the different axis, density. Our preliminary analyses suggest that the spatial interactions of learned attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon because dense attention maps are harder for the model to learn due to softmax. We interpret this opposite behavior against softmax as a strong preference for the ViT models to include dense interaction. We thus manually insert the dense uniform attention to each layer of the ViT models to supply the much-needed dense interactions. We call this method Context Broadcasting, CB. Our study demonstrates the inclusion of CB takes the role of dense attention, and thereby reduces the degree of density in the original attention maps by complying softmax in MSA. We also show that, with negligible costs of CB (1 line in your model code and no additional parameters), both the capacity and generalizability of the ViT models are increased.

Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

Authors: Moon Ye-bin (POSTECH), Jisoo Kim (Columbia University), Hongyeob Kim (Sungkyunkwan University), Kilho Son (Microsoft Azure), Tae-Hyun Oh (POSTECH)

[Project Page]

Recent label mix-based augmentation methods have shown their effectiveness in generalization despite their simplicity, and their favorable effects are often attributed to semantic-level augmentation. However, we found that they are vulnerable to highly skewed class distribution, because scarce data classes are rarely sampled for inter-class perturbation. We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of data distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. To this end, we bridge between the text representation and a target visual feature space, and propose an efficient vector augmentation. To empirically support the validity of our design, we devise two visualization-based analyses and show the plausibility of the bridge between two different modality spaces. Our experiments demonstrate that TextManiA is powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.

Title: Sound Source Localization is All about Cross-Modal Alignment

Authors: Arda Senocak (KAIST), Hyeonggon Ryu (KAIST), Junsik Kim (Harvard Univ.), Tae-Hyun Oh (POSTECH), Hanspeter Pfister (Harvard Univ.), Joon Son Chung(KAIST)

Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or offscreen sounds. To account for this, we propose a crossmodal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization.

And, Four papers have been accepted at ICCV Workshop (ICCVw)!

AI3DCC: AI for 3D Content Creation

Title: Text-driven Human Avatar Generation by Neural Re-parameterized Texture Optimization

Authors: Kim Youwang (POSTECH), Tae-Hyun Oh (POSTECH)

We present TexAvatar, a text-driven human texture generation system for creative human avatar synthesis. Despite the huge progress in text-driven human avatar generation methods, modeling high-quality, efficient human appearance remains challenging. With our proposed neural re-parameterized texture optimization, TexAvatar generates a high-quality UV texture in 30 minutes, given only a text description. The generated UV texture can be easily superimposed on animatable human meshes without further processing. This is distinctive in that prior works generate volumetric textured avatars that require cumbersome rigging processes to animate. We demonstrate that TexAvatar produces human avatars with favorable quality, with faster speed, compared to recent competing methods.

MMFM: What is Next in Multimodal Foundation Models?

Title: Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective

Authors: Moon Ye-Bin*, Nam Hyeon-Woo*, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh

We live in a vast ocean of data, and deep neural networks are no exception to this.
However, this data exhibits an inherent phenomenon of imbalance. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social consequences. Given the remarkable advancements of recent diffusion models generating high-quality images, we believe that using generative models is a promising approach to address these challenges. In this work, we propose a strong baseline, SYNAuG, that utilizes synthetic data as a preliminary step before employing task-specific algorithms to address data imbalance problems. This simple approach yields impressive performance improvement on the data imbalance task such as CIFAR100-LT and ImageNet100-LT. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be a crucial preliminary step in addressing data imbalance concerns. Note that this research is a work in progress.

Title: Multimodal Laughter Reasoning with Language Models

Authors: Lee Hyun*, Kim Sung-Bin*, Seungju Han, Youngjae Yu, Tae-Hyun Oh

Laughter is a substantial expression that occurs during social interactions between people. While it is essential to build social intelligence in machines, it is challenging for machines to understand the rationale behind the laughter. In this work, we introduce Laugh Reasoning, a new task that ascertains why a particular video induces laughter, accompanied by a new dataset and benchmark designed for this task. Our proposed dataset comprises video clips, their multimodal attributes including visual, semantic and acoustic features from the video, and language descriptions of why people laugh. We build our dataset by utilizing large language models’ general knowledge and incorporating it into human consensus. Our benchmark provides a baseline for the laugh reasoning task with language models, and by investigating the effect of multimodal information, we substantiate the significance of our dataset.

Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

Authors: Moon Ye-Bin (POSTECH), Jisoo Kim (Columbia University), Hongyeob Kim (Sungkyunkwan University), Kilho Son (Microsoft Azure), Tae-Hyun Oh (POSTECH)

We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis
that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.