Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Sijie Song; Cuiling Lan; Junliang Xing; Wenjun Zeng; Jiaying Liu

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Sijie Song ,
Cuiling Lan ,
Junliang Xing ,
Wenjun Zeng ,
Jiaying Liu

Inter. Conf. Multimedia & Expo | June 2018

Published by IEEE

Publication

Download BibTex

This paper presents a new framework for action recognition with multi-modal data. A skeleton-indexed feature learning procedure is developed to further exploit the detailed local features from RGB and optical flow videos. In particular, the proposed framework is built based on a deep Convolutional Network (ConvNet) and a Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM). A skeleton-indexed transform layer is designed to automatically extract visual features around key joints, and a part-aggregated pooling is developed to uniformly regulate the visual features from different body parts and actors. Besides, several fusion schemes are explored to take advantage of multi-modal data. The proposed deep architecture is end-to-end trainable and can better incorporate different modalities to learn effective feature representations. Quantitative experiment results on two datasets, the NTU RGB+D dataset and the MSR dataset, demonstrate the excellent performance of our scheme over other state-of-the-arts. To our knowledge, the performance obtained by the proposed framework is currently the best on the challenging NTU RGB+D dataset.