One collaboration work with KETI and University of Birmingham, UK, has been accepted to PR (IF 7.5)!
A Unified Framework for Unsupervised Action Learning via Global-to-local Motion Transformer
Pattern Recognition, 2024 (IF 7.5)
Boeun Kim, Jungho Kim, Hyung Jin Chang, Tae-Hyun Oh
Abstract
Human action recognition remains challenging due to the inherent complexity arising from the combination of diverse granularity of semantics, ranging from the local motion of body joints to high-level relationships across multiple people. To learn this multi-level characteristic of human action in an unsupervised manner, we propose a novel pretraining strategy along with a transformer-based model architecture namedĀ GL-Transformer++. Prior methods in unsupervised action recognition or unsupervised group activity recognition (GAR) have shown limitations, often focusing solely on capturing a partial scope of the action, such as the local movements of each individual or the broader context of the overall motion. To tackle this problem, we introduce a novel pretraining strategy namedĀ multi-interval pose displacement prediction (MPDP)Ā that enables the model to learn the diverse extents of the action. In the architectural aspect, we incorporate theĀ global and local attention (GLA)Ā mechanism within the transformer blocks to learn local dynamics between joints, global context of each individual, as well as high-level interpersonal relationships in both spatial and temporal manner. In fact, the proposed method is a unified approach that demonstrates efficacy in both action recognition and GAR. Particularly, our method presents a new and strong baseline, surpassing the current SOTA GAR method by significant margins: 29.6% in Volleyball and 60.3% and 59.9% on the xsub and xset settings of the Mutual NTU dataset, respectively.
The pre-print will be posted soon.