ICCVw’23] Four papers have been accepted

2023/08/22 06:52
1 more property

AI3DCC: AI for 3D Content Creation

Title: Text-driven Human Avatar Generation by Neural Re-parameterized Texture Optimization
Authors: Kim Youwang (POSTECH), Tae-Hyun Oh (POSTECH)
We present TexAvatar, a text-driven human texture generation system for creative human avatar synthesis. Despite the huge progress in text-driven human avatar generation methods, modeling high-quality, efficient human appearance remains challenging. With our proposed neural re-parameterized texture optimization, TexAvatar generates a high-quality UV texture in 30 minutes, given only a text description. The generated UV texture can be easily superimposed on animatable human meshes without further processing. This is distinctive in that prior works generate volumetric textured avatars that require cumbersome rigging processes to animate. We demonstrate that TexAvatar produces human avatars with favorable quality, with faster speed, compared to recent competing methods.
Title: Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective
Authors: Moon Ye-Bin*, Nam Hyeon-Woo*, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh
We live in a vast ocean of data, and deep neural networks are no exception to this. However, this data exhibits an inherent phenomenon of imbalance. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social consequences. Given the remarkable advancements of recent diffusion models generating high-quality images, we believe that using generative models is a promising approach to address these challenges. In this work, we propose a strong baseline, SYNAuG, that utilizes synthetic data as a preliminary step before employing task-specific algorithms to address data imbalance problems. This simple approach yields impressive performance improvement on the data imbalance task such as CIFAR100-LT and ImageNet100-LT. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be a crucial preliminary step in addressing data imbalance concerns. Note that this research is a work in progress.
Title: Multimodal Laughter Reasoning with Language Models
Authors: Lee Hyun*, Kim Sung-Bin*, Seungju Han, Youngjae Yu, Tae-Hyun Oh
Laughter is a substantial expression that occurs during social interactions between people. While it is essential to build social intelligence in machines, it is challenging for machines to understand the rationale behind the laughter. In this work, we introduce Laugh Reasoning, a new task that ascertains why a particular video induces laughter, accompanied by a new dataset and benchmark designed for this task. Our proposed dataset comprises video clips, their multimodal attributes including visual, semantic and acoustic features from the video, and language descriptions of why people laugh. We build our dataset by utilizing large language models’ general knowledge and incorporating it into human consensus. Our benchmark provides a baseline for the laugh reasoning task with language models, and by investigating the effect of multimodal information, we substantiate the significance of our dataset.
Title: TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation
Authors: Moon Ye-Bin (POSTECH), Jisoo Kim (Columbia University), Hongyeob Kim (Sungkyunkwan University), Kilho Son (Microsoft Azure), Tae-Hyun Oh (POSTECH)
We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.