ICCV’23] Three papers have been accepted

2023/07/17 11:28
1 more property

  Three Papers have been accepted to ICCV 2023

International Conference on Computer Vision (ICCV) is the premier international computer vision event comprising the main conference and several co-located workshops and tutorials.
Title: Scratching Visual Transformer’s Back with Uniform Attention
Authors: Nam Hyeon-Woo (POSTECH), Kim Yu-Ji (POSTECH), Byeongho Heo (NAVER AI LAB), Dongyoon Han (NAVER AI Lab), Seong Joon Oh (University of Tübingen), Tae-Hyun Oh (POSTECH)
The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA), which enables global interactions at each layer of
a ViT model. Previous works acknowledge the property of long-range dependency for the effectiveness in MSA. In this work, we study the role of MSA in terms of the different axis, density. Our preliminary analyses suggest that the spatial interactions of learned attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon because dense attention maps are harder for the model to learn due to softmax. We interpret this opposite behavior against softmax as a strong preference for the ViT models to include dense interaction. We thus manually insert the dense uniform attention to each layer of the ViT models to supply the much-needed dense interactions. We call this method Context Broadcasting, CB. Our study demonstrates the inclusion of CB takes the role of dense attention, and thereby reduces the degree of density in the original attention maps by complying softmax in MSA. We also show that, with negligible costs of CB (1 line in your model code and no additional parameters), both the capacity and generalizability of the ViT models are increased.
Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
Authors: Moon Ye-bin (POSTECH), Jisoo Kim (Columbia University), Hongyeob Kim (Sungkyunkwan University), Kilho Son (Microsoft Azure), Tae-Hyun Oh (POSTECH)
Recent label mix-based augmentation methods have shown their effectiveness in generalization despite their simplicity, and their favorable effects are often attributed to semantic-level augmentation. However, we found that they are vulnerable to highly skewed class distribution, because scarce data classes are rarely sampled for inter-class perturbation. We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of data distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. To this end, we bridge between the text representation and a target visual feature space, and propose an efficient vector augmentation. To empirically support the validity of our design, we devise two visualization-based analyses and show the plausibility of the bridge between two different modality spaces. Our experiments demonstrate that TextManiA is powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.
Title: Sound Source Localization is All about Cross-Modal Alignment
Authors: Arda Senocak (KAIST), Hyeonggon Ryu (KAIST), Junsik Kim (Harvard Univ.), Tae-Hyun Oh (POSTECH), Hanspeter Pfister (Harvard Univ.), Joon Son Chung(KAIST)
Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or offscreen sounds. To account for this, we propose a crossmodal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization.