👉

Publication

Information about computer vision and machine learning academic field * Top-tier conferences : CVPR, ICCV, ECCV, NeurIPS (NIPS), ICML, and ICLR are considered high prestigious and top-tier conferences, which have greater impact than most SCI journals. According to the Google scholar metrics, all these conferences are listed within Top 100 across all academic fields. Out of them, CVPR is the 5th rank among all academic fields (1st rank: Nature, 3rd rank: Science), and the the acceptance rate of poster presentations is about 20%, i.e., highly competitive. Oral presentations of those conferences are less than 4% acceptance rate. * Top-tier journals : IEEE TPAMI and IJCV have the highest impact factors across all computer science categories. As of 2020, the impact factor of TPAMI is 17.861 (the second-highest impact factor of all IEEE publications).
Search
International Journal
15
International Conference
42
Domestic Publication
6
Thesis
2
Other Int'l Conference
0
[Abstract]
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360-degree° videos.
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
TPAMI
2021
[Abstract]
In this work, we describe man-made structures via an appropriate structure assumption, called the Atlanta world assumption, which contains a vertical direction (typically the gravity direction) and a set of horizontal directions orthogonal to the vertical direction. Contrary to the commonly used Manhattan world assumption, the horizontal directions in Atlanta world are not necessarily orthogonal to each other. While Atlanta world can encompass a wider range of scenes, this makes the search space much larger and the problem more challenging. Our input data is a set of surface normals, for example, acquired from RGB-D cameras or 3D laser scanners, as well as lines from calibrated images. Given this input data, we propose the first globally optimal method of inlier set maximization for Atlanta direction estimation. We define a novel search space for Atlanta world, as well as its parametrization, and solve this challenging problem using a branch-and-bound (BnB) framework. To alleviate the computational bottleneck in BnB, i.e., the bound computation, we present two bound computation strategies: rectangular bound and slice bound in an efficient measurement domain, i.e., the extended Gaussian image (EGI). In addition, we propose an efficient two-stage method which automatically estimates the number of horizontal directions of a scene. Experimental results with synthetic and real-world datasets have successfully confirmed the validity of our approach.
Globally Optimal Inlier Set Maximization for Atlanta World Understanding
Kyungdon Joo, Tae-Hyun Oh, In So Kweon, Jean-Charles Bazin
TPAMI
2020
[Abstract]
Rank minimization can be converted into tractable surrogate problems, such as Nuclear Norm Minimization (NNM) and Weighted NNM (WNNM). The problems related to NNM, or WNNM, can be solved iteratively by applying a closed-form proximal operator, called Singular Value Thresholding (SVT), or Weighted SVT, but they suffer from high computational cost of Singular Value Decomposition (SVD) at each iteration. We propose a fast and accurate approximation method for SVT, that we call fast randomized SVT (FRSVT), with which we avoid direct computation of SVD. The key idea is to extract an approximate basis for the range of the matrix from its compressed matrix. Given the basis, we compute partial singular values of the original matrix from the small factored matrix. In addition, by developping a range propagation method, our method further speeds up the extraction of approximate basis at each iteration. Our theoretical analysis shows the relationship between the approximation bound of SVD and its effect to NNM via SVT. Along with the analysis, our empirical results quantitatively and qualitatively show that our approximation rarely harms the convergence of the host algorithms. We assess the efficiency and accuracy of the proposed method on various computer vision problems, e.g., subspace clustering, weather artifact removal, and simultaneous multi-image alignment and rectification.
Fast Randomized Singular Value Thresholding for Low-rank Optimization
Tae-Hyun Oh, Yasuyuki Matsushita, Yu-Wing Tai, In So Kweon
TPAMI
2018
[Abstract]
Robust Principal Component Analysis (RPCA) via rank minimization is a powerful tool for recovering underlying low-rank structure of clean data corrupted with sparse noise/outliers. In many low-level vision problems, not only it is known that the underlying structure of clean data is low-rank, but the exact rank of clean data is also known. Yet, when applying conventional rank minimization for those problems, the objective function is formulated in a way that does not fully utilize a priori target rank information about the problems. This observation motivates us to investigate whether there is a better alternative solution when using rank minimization. In this paper, instead of minimizing the nuclear norm, we propose to minimize the partial sum of singular values, which implicitly encourages the target rank constraint. Our experimental analyses show that, when the number of samples is deficient, our approach leads to a higher success rate than conventional rank minimization, while the solutions obtained by the two approaches are almost identical when the number of samples is more than sufficient. We apply our approach to various low-level vision problems, e.g. high dynamic range imaging, motion edge detection, photometric stereo, image alignment and recovery, and show that our results outperform those obtained by the conventional nuclear norm rank minimization method.
Partial Sum Minimization of Singular Values in Robust PCA: Algorithm and Applications
Tae-Hyun Oh, Yu-Wing Tai, Jean-Chales Bazin, Hyeongwoo Kim, In So Kweon
TPAMI
2016
[Abstract]
We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions of each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS, i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to not only learn to generate captions but also predict the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. We additionally extend analysis to an ablation study, applications on holistic image captioning, scene graph generation, and retrieval tasks.
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning
Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon
CVPR
2019
Contextually Customized Video Summaries via Natural Language
Jinsoo Choi, Tae-Hyun Oh, In So Kweon
WACV
2018
Learning to Localize Sound Source in Visual Scenes
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
CVPR
2018
Globally Optimal Inlier Set Maximization for Atlanta Frame Estimation
Kyungdon Joo, Tae-Hyun Oh, In So Kweon, Jean-Charles Bazin
CVPR
2018